**The Information Retrieval Series**

Andrea Esuli Alessandro Fabris Alejandro Moreo Fabrizio Sebastiani

# Learning to Quantify

# **The Information Retrieval Series**

Volume 47

# **Series Editors**

ChengXiang Zhai, University of Illinois, Urbana, IL, USA Maarten de Rijke, University of Amsterdam, The Netherlands and Ahold Delhaize, Zaandam, The Netherlands

# **Editorial Board Members**

Nicholas J. Belkin, Rutgers University, New Brunswick, NJ, USA Charles Clarke, University of Waterloo, Waterloo, ON, Canada Diane Kelly, University of Tennessee at Knoxville, Knoxville, TN, USA Fabrizio Sebastiani , Consiglio Nazionale delle Ricerche, Pisa, Italy

Information Retrieval (IR) deals with access to and search in mostly unstructured information, in text, audio, and/or video, either from one large file or spread over separate and diverse sources, in static storage devices as well as on streaming data. It is part of both computer and information science, and uses techniques from e.g. mathematics, statistics, machine learning, database management, or computational linguistics. Information Retrieval is often at the core of networked applications, web-based data management, or large-scale data analysis.

The Information Retrieval Series presents monographs, edited collections, and advanced text books on topics of interest for researchers in academia and industry alike. Its focus is on the timely publication of state-of-the-art results at the forefront of research and on theoretical foundations necessary to develop a deeper understanding of methods and approaches.

This series is abstracted/indexed in EI Compendex and Scopus.

Andrea Esuli • Alessandro Fabris • Alejandro Moreo • Fabrizio Sebastiani

Learning to Quantify

Andrea Esuli Istituto di Scienza e Tecnologie dell'Informazione Consiglio Nazionale delle Ricerche Pisa, Italy

Alejandro Moreo Istituto di Scienza e Tecnologie dell'Informazione Consiglio Nazionale delle Ricerche Pisa, Italy

Alessandro Fabris Dipartimento di Ingegneria dell'Informazione Università di Padova Padova, Italy

Fabrizio Sebastiani Istituto di Scienza e Tecnologie dell'Informazione Consiglio Nazionale delle Ricerche Pisa, Italy

This work was supported by Istituto di Scienza e Tecnologie dell'Informazione

ISSN 1871-7500 ISSN 2730-6836 (electronic) The Information Retrieval Series ISBN 978-3-031-20466-1 ISBN 978-3-031-20467-8 (eBook) https://doi.org/10.1007/978-3-031-20467-8

© The Editor(s) (if applicable) and The Author(s) 2023. This book is an open access publication. **Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

*Policy makers or computer scientists may be interested in finding the needle in the haystack (. . . ), but social scientists are more commonly interested in characterizing the haystack.*

(Daniel J. Hopkins and Gary King, 2010)

# **Preface**

In a number of applications involving classification, the final goal is not determining which class (or classes) individual unlabelled instances belong to, but estimating the *prevalence* (or "relative frequency", or "prior probability") of each class in the unlabelled data. In recent years it has been pointed out that, in these cases, it would make sense to directly optimise machine learning algorithms for this goal, rather than (somehow indirectly) just optimising the classifier's ability to label individual instances. The task of training estimators of class prevalence via supervised learning is known as *learning to quantify*, or, more simply, *quantification*. It is by now well known that performing quantification by classifying each unlabelled instance via a standard classifier and then counting the instances that have been assigned to the class (the *Classify and Count* method) usually leads to biased estimators of class prevalence, i.e., to poor quantification accuracy; as a result, methods (and evaluation measures) that address quantification as a task in its own right have been developed. This book covers the main applications of quantification, the main methods that have been developed for learning to quantify, the measures that have been adopted for evaluating it, and the challenges that still need to be addressed by future research.

The book is divided in seven chapters. Chapter 1 sets the stage for the rest of the book by introducing fundamental notions such as class distributions, their estimation, and dataset shift, by arguing for the suboptimality of using classification techniques for performing this estimation, and by discussing why learning to quantify has evolved as a task of its own, rather than remaining a by-product of classification. Chapter 2 provides the motivation for what is to come by describing the applications that quantification has been put at, ranging from improving classification accuracy in domain adaptation, to measuring and improving the fairness of classification systems with respect to a sensitive attribute, to supporting research and development in the social sciences, in political science, epidemiology, market research, and others. In Chapter 3 we move on to discuss the experimental evaluation of quantification systems; we look at evaluation measures for the various types of quantification systems (binary, single-label multiclass, multi-label multiclass, ordinal), but also at evaluation protocols for quantification, that essentially consist in ways to extract multiple testing samples for use in quantification evaluation from a single classification test set. Chapter 4 is possibly the central chapter of the book, and looks at the various supervised learning methods for learning to quantify that have been proposed over the years, be they of an aggregative nature (i.e., methods that require the classification of all individual unlabelled items as an intermediate step) or of a non-aggregative nature (i.e., methods in which no classification of individual items is performed). In Chapter 5 we look at a number of "advanced" (or niche) topics in quantification, including quantification for ordinal data, cross-lingual quantification of textual items, quantification for networked data, and quantification for streaming data. Chapter 6 looks at other aspects of the "quantification landscape" that have not been covered in the previous chapters, and discusses the evolution of quantification research, from its beginnings to the most recent quantification-based "shared tasks", the landscape of quantification-based, publicly available software libraries, and other tasks in data science that present important similarities with quantification. Chapter 6 also presents the results of experiments, that we have carried out ourselves, in which we compare many of the methods discussed in Chapter 4 on a common testing infrastructure. Chapter 7 concludes the book, pointing to potential future developments in the quantification arena.

The book is mostly addressed to researchers in data science that might want to come up to speed with the state of the art in learning to quantify, but it can be useful also to researchers and scientists that operate in other disciplines and that apply techniques from data science to their own application domains. Indeed, it is our experience that many potential users of quantification techniques (who operate in the fields touched upon in Chapter 2, and possibly in others too) do not use them, thus settling for suboptimal "classify and count" techniques, for the simple fact that they are not aware of their existence, and of the existence of quantification as a task of its own; it is also those potential users that we hope will be inspired by this book.

We thus hope that the availability of a book that surveys all aspects of the quantification workflow and presents them in a hopefully accessible form, will increase the interest in this subject on the part of researchers and practitioners alike, and will contribute to making quantification better known to potential users of this technology and to researchers interested in advancing the field.

Pisa, Italy Andrea Esuli Padova, Italy Alessandro Fabris Pisa, Italy Alejandro Moreo Pisa, Italy Fabrizio Sebastiani

# **Acknowledgments**

The work of Andrea Esuli, Alejandro Moreo, and Fabrizio Sebastiani has been supported by the SoBigData++ project, funded by the European Commission (Grant 871042) under the H2020 Programme INFRAIA-2019-1, by the AI4Media project, funded by the European Commission (Grant 951911) under the H2020 Programme ICT-48-2020, and by the SoBigData.it and FAIR projects, funded by the Italian Ministry of University and Research under the NextGenerationEU program. The authors' opinions do not necessarily reflect those of the European Commission. The work by Alessandro Fabris was supported by MIUR (Italian Ministry for University and Research) under the "Departments of Excellence" initiative (Law 232/2016).

# **Contents**




# **Acronyms**



# **Chapter 1 The Case for Quantification**

Classification, perhaps the most fundamental among the tasks addressed by supervised machine learning, has to do with assigning one or more classes from a predefined set to each data item from a given distribution. Over the last 50 years or more, classification has been extensively studied, not only in machine learning but also in philosophy, content analysis, statistics, and other branches of science.

About fifteen years ago, in a seminal paper, Forman (2005) observed that, in several applications involving classification, the final goal is not determining which class (or classes) individual unlabelled data items belong to, but estimating the *prevalence* (also called "relative frequency", or "prior probability", or simply "prior") of each class in the unlabelled data. Training class prevalence estimators via supervised learning has come to be known as *quantification*, a term coined by Forman (2005, 2006, 2008) which has stuck from then on; the term *learning to quantify* is also used, which stresses the fact that prevalence estimation is, in this case, tackled by means of supervised learning.

To see the importance of learning to quantify, let us examine the task of classifying textual answers returned to open-ended questions in questionnaires (Esuli and Sebastiani, 2010b), and let us discuss two important such scenarios.

In the first scenario, a telecommunications company asks its current customers the question "How satisfied are you with our mobile phone services?", and has its information scientists classify each resulting textual answer into one of a set of classes of interest. One of the goals of this survey is to know which of the resulting textual answers belong to class MayDefectToCompetition. The company is likely interested in accurately classifying each individual customer, since it may want to call each customer that is assigned the class MayDefectToCompetition and offer her improved conditions, so as not to lose her as a customer.

In the second scenario, a market research expert, working for a fast food company, asks respondents the question "What do you think of onions in cheeseburgers?", and wants to know which of the resulting textual answers belong to class LikesOnionsInCheeseburgers. Here, the market research expert is presumably *not* interested in whether a specific individual belongs to the class, but is likely interested in knowing *how many* respondents, out of the total number of respondents, belong to it, i.e., in knowing the prevalence of the class.

In sum, while in the former scenario the interest is at the individual level, in the latter the aggregate level is all that matters; in other words, in the former scenario classification is the goal, while in the latter the real goal is quantification.

Other tasks in which "individuals do not matter", i.e., in which the classes to which belong are useful only inasmuch as they allow us to obtain indicators concerning the entire population, are, e.g., predicting election results by estimating the prevalence of blog posts (or tweets) supporting a given candidate or party (Hopkins and King, 2010), or planning the amount of human resources to allocate to different types of issues in a customer support centre by estimating the prevalence of customer calls related to each issue (Forman, 2005), or supporting epidemiological research by estimating the prevalence of medical reports where a specific pathology is diagnosed (Baccianella et al., 2013). Indeed, there are entire fields of human inquiry which are devoted to studying phenomena only at a collective level; examples of such fields are market research, political science, the social sciences, ecological modelling, and epidemiology. When researchers in these fields are confronted with unlabelled data and the need to label them, they usually need quantification, and not classification.

Note that, also due to the variety of fields in which it has emerged as an application need, quantification goes under different names, in different areas of science and in different scientific papers. It has variously been called *counting*, (Lewis, 1995), *class probability re-estimation* (Alaíz-Rodríguez et al., 2011), *class prior estimation* (Chan and Ng, 2006; Zhang and Zhou, 2010), and *class distribution estimation* (González-Castro et al., 2013; Limsetto and Waiyamai, 2011; Xue and Weiss, 2009).

## **1.1 Class Distributions and Their Estimation**

An example quantification task is displayed, via a histogram, in Figure 1.1. The example involves a number of textual product reviews labelled according to a set of five classes (from VeryNegative to VeryPositive) representing "scores" assigned to the reviewed products. In the histogram, the blue bars represent the true (unknown) class prevalence values that need to be estimated (i.e., the fractions of product reviews that have been assigned the scores indicated), and the red bars represent the corresponding estimates obtained by a quantification method. When the blue bars are identical to the corresponding red bars, the estimation is perfectly accurate. Since all the fractions are in the [0,1] interval and sum up to 1, we are here in the presence of two *probability distributions*. This shows that learning to quantify may be also defined as the task of learning to approximate an unknown *true distribution* by a *predicted distribution*. (In the case of Figure 1.1 we are actually in the presence of two *ordinal* distributions, since there are more than two classes and there is

**Fig. 1.1** An example quantification task; blue bars represent the unknown true class prevalence values that need to be estimated, and red bars represent their estimates obtained by a quantification method.

an implied total order on them; see Sections 3.2 and 5.1 for more on ordinal distributions and their estimation.) As a result, and as we will see more thoroughly in Section 3, practically all evaluation measures for quantification are *divergences*, i.e., measures of how a predicted distribution "diverges" from the true distribution. This justifies the fact that, as previously hinted, quantification is sometimes called "class distribution estimation" (González-Castro et al., 2013; Limsetto and Waiyamai, 2011; Xue and Weiss, 2009).

## **1.2 The Suboptimality of** *Classify and Count*

In the absence of methods for estimating class prevalence values more directly, the obvious method for doing it is *Classify and Count*, i.e., classifying each unlabelled data item and estimating class prevalence values by counting the items that have been assigned to each class.

However, this strategy is sub-optimal: while a perfect classifier is also, quite obviously, a perfect *quantifier* (i.e., estimator of class prevalence values), a good classifier may be a bad quantifier. To see this, one only needs to look at the definition of *F*1, a standard evaluation function for binary classification, which is defined as

$$F\_{\rm l} = \frac{2\text{TP}}{2\text{TP} + \text{FP} + \text{FN}}\tag{1.1}$$

where TP, FP, FN indicate the numbers of true positives, false positives, and false negatives, respectively, in a binary contingency table. According to *F*1, a binary classifier *h*<sup>1</sup> for which FP = 10 and FN = 10 is worse than a classifier *h*<sup>2</sup> for which, on the same test set, FP = 8 and FN = 10. However, when using "classify and count", *h*<sup>1</sup> is intuitively a better binary quantifier than *h*2; indeed, *h*<sup>1</sup> is (on this test set) a perfect estimator of class prevalence values, since FP and FN are equal and thus compensate each other, so that the distribution of the unlabelled items across the class and its complement is estimated perfectly. That a good classifier may be a bad quantifier can be seen by the fact that, as evident from Equation 1.1, *F*<sup>1</sup> considers "good" those classifiers that keep the sum *(*FP + FN*)* to a minimum; however, the goal of a quantification algorithm must be that of keeping to a minimum |FP − FN|, and not *(*FP + FN*)*.

The above example shows that even an accurate classifier may be *biased*, i.e., may keep its false positives to a minimum only at the expense of a substantially higher number of false negatives (or vice versa); if this is the case, the classifier is a bad quantifier. This phenomenon is not infrequent, especially in the presence of imbalanced data, i.e., data in which the items from the majority class by far outnumber the items from the other classes. This is very frequent, say, in text classification, where data relevant to a certain topic are often a tiny fraction of the entire set; but occurs in all other contexts in which the amount of "signal" is much smaller than the amount of "noise". In such cases, learning algorithms that minimise "standard" loss functions (i.e., the Hamming loss, the hinge loss, or their proxies) often generate classifiers with a tendency to choose the majority class, which means a much higher number of false positives than false negatives for the majority class, which means in turn that such an algorithm will tend to underestimate the counts of minority classes. For instance, Esuli and Sebastiani (2015) report an experimentation on 5,148 binary test sets averaging 15,000+ examples each, in which a linear SVM delivers an average FN*/*FP ratio of 0.109 for the majority class; by contrast, for a perfect estimator of class prevalence values this ratio is 1.

The previous arguments indicate that *quantification should not be considered a mere by-product of classification, and should be studied and solved as a task of its own*. There are at least two other arguments that support this idea. One is that the functions that are used for evaluating classification cannot be used for evaluating quantification, since these functions measure, by and large, how many data items have been misclassified, and not how much the estimated class prevalence values differ from the true class prevalence values. This means that the learning algorithms that minimise these functions are optimised for classification, and not for quantification. (We will come back on this topic in Section 4.3.1.) A second, symmetrical argument, put forth by Forman (2008), is that methods specifically devised for learning to quantify require fewer training data in order to deliver the same quantification accuracy as standard methods based on "classify and count". While Forman's observation is of an empirical nature, there are also theoretical arguments that support this fact, which will be more thoroughly discussed in Section 4.4.

## **1.3 Notational Conventions**

Since in the next section we will start discussing quantification in some mathematical detail, we now fix some notation. By **x** we will indicate a data item drawn from a domain *X*, represented as a vector of features. By *y* we will indicate a class drawn from a set of classes (or *codeframe*) *Y* = {*y*1*,...,y*|*Y*|}, and by *y* we will indicate its complement, i.e., *y* = - *yi*∈*Y*\{*y*} *yi*. When the codeframe contains just two classes we will often indicate this codeframe as *Y* = {⊕*,* }, and will call ⊕ "the positive class" and "the negative class". Given **x** ∈ *X* and *y* ∈ *Y*, a pair *(***x***,y)* will thus denote a data item with its class label; given a pair *(***x***,y)* we will also write *-(***x***)* = *y*, i.e., *-(***x***)* will indicate the label of **x**. <sup>1</sup> The symbol *σ* will denote a *sample*, i.e., a non-empty set of (labelled or unlabelled) items drawn from *X*. Given a class *yi*, we will denote by *σi* the set of items in sample *σ* that belong to *yi*; we will denote by |*σ*| the number of items contained in *σ*.

By *pσ (y)* we will indicate the true prevalence of class *y* in sample *σ*, by *p*ˆ*<sup>σ</sup> (y)* we will indicate an estimate of this prevalence2, and by *<sup>p</sup>*ˆ*<sup>M</sup> <sup>σ</sup> (y)* we will indicate the estimate of this prevalence as obtained via quantification method *M*. In other words, symbol *p* will denote a true distribution of the unlabelled items across codeframe *Y*, while symbol *p*ˆ will denote a predicted distribution (or *estimator*), i.e., the result of estimating an unknown true distribution; symbol *P* will denote the (infinite) set of all distributions on *Y*. <sup>3</sup> By *D(p, p)*<sup>ˆ</sup> we will denote an evaluation measure for quantification.

A sample of labelled items (that we will typically use as a training set) will be denoted by *L*, while a sample of unlabelled items (that we will typically use as a sample to quantify on) will be denoted by *U*.

We will take a (*hard*) *classifier* to be a function *<sup>h</sup>* : *<sup>X</sup>* <sup>→</sup> *<sup>Y</sup>*. By *<sup>p</sup><sup>h</sup> <sup>σ</sup> (y)*ˆ we will denote the prevalence in sample *σ* of the data items that have been assigned to class *y* by classifier *h*. When dealing with binary contexts, we will use TP, FP, FN, TN, to denote the numbers of true positives, false positives, false negatives, true negatives, respectively, as resulting from the application of a hard classifier to an unlabelled sample *U*, and as contained in the resulting binary contingency table.

We will instead take a *soft classifier* to be a function *s* : *X* → [0*,* 1] |*Y*| such that each *s(***x***)* is a vector of |*Y*| *posterior probabilities* (each indicated as *p(y*|**x***)*) and such that *<sup>y</sup>*∈*<sup>Y</sup> p(y*|**x***)* <sup>=</sup> 1; *p(y*|**x***)* indicates the probability of membership in *<sup>y</sup>*

<sup>1</sup> For the moment being we assume that a data item **<sup>x</sup>** <sup>∈</sup> *<sup>X</sup>* can belong to one and only one class *y* ∈ *Y*; the reason for this will be explained in Section 1.4.

<sup>2</sup> Consistently with most mathematical literature, we use the caret symbol (ˆ) to indicate estimation.

<sup>3</sup> In order to keep things simple we avoid overspecifying the notation, thus leaving some aspects of it implicit; e.g., in order to indicate a true distribution *p* of the unlabelled items in a sample *σ* across a codeframe *Y* we will often write *p* instead of the more cumbersome *p<sup>Y</sup> <sup>σ</sup>* , thus letting *σ* and *Y* be inferred from context.


**Table 1.1** Notation for the symbols most frequently used in this book.

of item **x** as estimated by *s*. <sup>4</sup> A hard classifier is obtained from a soft classifier by taking

$$h(\mathbf{x}) = \arg\max\_{\mathbf{y} \in \mathcal{Y}} p(\mathbf{y}|\mathbf{x}) \tag{1.2}$$

Table 1.1 summarises these symbols for convenience.

## **1.4 Quantification Problems**

Similarly to classification, learning to quantify admits different problems of applicative interest, based (a) on how many classes codeframe *Y* contains, and (b) how many of the classes in *Y* can be attributed at the same time to the same item. We characterise quantification problems as follows:

1. *Single-Label Quantification* (SLQ) is defined as quantification when each data item belongs to exactly one of the classes in *Y* = {*y*1*,...,y*|*Y*|}.

<sup>4</sup> Another way of saying this is that *<sup>s</sup>* is a function that maps the domain *<sup>X</sup>* onto the *probability simplex* (aka *standard simplex*) |*Y*| , defined as the unit (|*Y*| − 1)-simplex.

	- (a) as SLQ with |*Y*| = 2 (in this case *Y* = {*y*1*, y*2} (or, as we will often write in the binary case, *Y* = {⊕*,* }) and each item must belong to either *y*<sup>1</sup> or *y*2), or
	- (b) as MLQ with |*Y*| = 1 (in this case *Y* = {*y*} and each item either belongs or does not belong to *y*).

Among the above tasks, the one we will mostly devote our attention to in this book is SLQ. The reasons for doing this are the following:


<sup>5</sup> MLQ might in principle be solved in ways other than by recasting the problem into <sup>|</sup>*Y*<sup>|</sup> independent binary quantification problems, i.e., it might be solved by attempting to leverage possible stochastic dependencies between the classes in *Y*, similarly to what is done in many approaches to multi-label classification. For MLQ, the only attempt we are aware of from past literature is by Levin and Roitman (2017). However, in this work the problem is tackled as a set of independent binary quantification problems, and the correlations among the classes are never brought to bear.

However, while SLQ will be the main focus of the book, the solutions that have been proposed in the literature for other quantification problems, such as OQ and RQ, will also be discussed.

## **1.5 Dataset Shift and Quantification**

Standard supervised learning algorithms are based on the assumption that the training data and the unlabelled data the predictor is supposed to issue predictions about (which in experimental settings is represented by the test set), are *independently and identically distributed* (IID). In other words, since labelled data items are represented by pairs of type *(***x***,y)*, the distribution of pairs in the labelled set is assumed to be the same as that on the set of unlabelled items, i.e., *pL(***x***,y)* = *pU (***x***,y)*. Of particular interest to quantification is the fact that, *a fortiori*, the distribution of labels is assumed to stay constant, i.e., *pL(y)* = *pU (y)*.

But the world we live in and the data it provides are constantly evolving, and the scenarios in which we might want to deploy the trained models may widely differ. For instance, in an effort to use quantification technology for estimating the prevalence of different species of living beings on the seabed (see Figure 1.2),

**Fig. 1.2** Using quantification for estimating the prevalence of different species of living beings on the seabed; red circles indicate the locations where the training data were collected while blue circles indicate the locations where the unlabelled data to which the trained model was applied were collected (from Beijbom et al., 2015).

Beijbom et al. (2015) train a model on labelled data mostly collected on the coasts off the Bahamas and Caicos islands, and apply the trained model on unlabelled data collected in various other locations, including the coasts off Mexico and Venezuela. In this case, and in many other cases, the IID assumption is violated, and *pL(***x***,y)* = *pU (***x***,y)*; this phenomenon is usually referred to as *dataset shift* (Moreno-Torres et al., 2012; Quiñonero-Candela et al., 2009).<sup>6</sup> Of interest to us is the particular case in which *pL(y)* = *pU (y)*, which is often called *distribution shift* (Bella et al., 2014).

Example reasons why distribution shift may occur are the following:


<sup>6</sup> The word "drift" is also often used in place of "shift" in the machine learning literature; this applies not only to term "dataset shift " but also to the various types of shift we will discuss in this section.

**Fig. 1.3** Example of distribution shift in the RCV1-v2 test collection.

general, when using active learning in order to build the training set, dataset shift will be present regardless of the active learning technique used (i.e., relevance sampling or other), for the simple fact that all active learning techniques force the assessor to annotate items in a non-random fashion, and this divergence from randomness inherently means dataset shift.

Bullets 2 and 3 are both examples of *sample selection bias*, a term that refers to the presence of a systematic bias (sometimes intended, sometimes unintended) either in the process of data collection or in the process of data labelling, and to the fact that due to this bias the distribution of training examples ends up being different from the distribution of data in the domain to be modelled.

Figure 1.3 illustrates an example of distribution shift in the well-known RCV1 v2 test collection (Lewis et al., 2004). This collection consists of one year's worth of timestamped news published by Reuters from Aug 20, 1996, to Aug 19, 1997. The blue curve in Figure 1.3 is the result of binning these 804,414 news stories into 52 bins, one per week, and computing the prevalence in each bin of one of the 101 classes (class E21) of which the codeframe consists. The *x* axis indicates the week on which the prevalence value is to be computed, while the *y* axis indicates the corresponding prevalence value. The blue curve represents the true prevalence values, while the other three curves represent the prevalence values as estimated by three of the quantification methods that we will discuss in Section 4 (the ACC, PACC, and SVM(KLD) methods, discussed in Sections 4.2.3, 4.2.4, and 4.3.1, respectively). The fact that there is distribution shift in this dataset is shown by the fact that the blue curve is not a flat line; a good quantifier is one that generates a curve as close to the blue curve as possible. Note that, while there is indeed some dataset shift here, its magnitude is not high, as shown by the fact that the oscillations of the blue curve around the *y* = 0*.*05 line are moderate. Other applicative scenarios exhibit a much more marked distribution shift.

Note that the presence of dataset shift, and of distribution shift in particular, is the *raison d'être* of applications that track class prevalence across different contexts (i.e., across time, space, or other variables), i.e., of studying quantification. If we could assume that there is no dataset shift, i.e., that *pL(***x***,y)* always equals *pU (***x***,y)*, the optimal quantification strategy would be to assume that, for each *y* ∈ *Y*, *pL(y)* = *pU (y)* for all unlabelled samples *U*. (This trivial strategy, that we call *Maximum Likelihood Prevalence Estimation* (MLPE), will be discussed in Section 4.1.) In other words, the reason for studying and solving quantification lies in the awareness that dataset shift, and distribution shift in particular, exists.

# *1.5.1 Types of Dataset Shift and Their Relation to Quantification*

In order to assess the impact of distribution shift on quantification, it is useful to note that *p(y)* may be written as

$$p(\mathbf{y}) = \sum\_{\mathbf{x}} p(\mathbf{y}|\mathbf{x}) p(\mathbf{x}) \tag{1.3}$$

When any of *p(y*|**x***)* and *p(***x***)* vary in switching from the training data to the unlabelled data, distribution shift occurs. The case in which *p(***x***)* varies occurs when certain regions of the feature space are more densely populated in *U* than in *L* while other regions are correspondingly less densely populated in *U* than in *L*; this phenomenon is usually called *covariate shift*. For instance, the example about class Terrorism in Bullet 1 above is a case of covariate shift, as is the example in Bullet 2. Instead, the case in which *p(y*|**x***)* varies occurs when the meaning of class *y* has changed (where "meaning" is to be understood in the sense of extensional semantics), and the very same item **x** that had label *y* in *L* might not have label *y* in *U*; this phenomenon is usually called *concept shift*. For instance, the example about news falling in the HomeNews or Europe classes in Bullet 1 above is a case of concept shift.

Figure 1.4 (taken from Bella et al., 2014) exemplifies covariate shift, concept shift, and the distribution shift that derives from them, in graphical form. The plots are the result of an experiment for a regression task, where labels take values not on a discrete codeframe but on the set of real numbers (here: on the [0,1] interval), and where we assume the existence of a single feature *x*. The top left sub-figure shows the distribution of the examples in the training set. The top right sub-figure shows the distribution of the examples in a test set which exhibits neither covariate shift nor concept shift (i.e., the training set and the test set are IID). The bottom left sub-figure shows the distribution of the examples in a test set which exhibits no concept shift (since *p(y*|**x***)* is the same as in the training set) but reveals the presence

**Fig. 1.4** Distribution shift and concept shift in regression; the image is from Bella et al. (2014), where "concept shift" is called (as often happens in machine learning literature) "concept drift".

of covariate shift (since *p(***x***)* is not the same as in the training set), which in turns generates distribution shift (*p(y)* not being the same as in the training set). The bottom right sub-figure shows the distribution of the examples in a test set which exhibits no covariate shift (since *p(***x***)* is the same as in the training set) but reveals the presence of concept shift (since *p(y*|**x***)* is not the same as in the training set), which also causes distribution shift to happen (*p(y)* being different in the training set and in the test set).

Covariate shift, concept shift, distribution shift, and Equation 1.3 are relevant in what Fawcett and Flach (2005) have called *X* → *Y problems*, i.e., problems in which it is the values of the features in **x** that probabilistically determine the label *y* of **x**. An example of an *X* → *Y* learning problem is weather forecasting, since it is a number of climatic conditions (e.g., pressure, temperature, humidity, etc., that can be represented in a feature vector **x**) that determine whether it is going to snow or not (a fact that can be represented by a binary dependent variable *y*), and not the other way around. In these cases, if the distribution of climatic conditions shifts, the probability that it is going to snow shifts too.

It is also useful to note that *p(***x***)* may be written as

$$p(\mathbf{x}) = \sum\_{\mathbf{y}} p(\mathbf{x}|\mathbf{y}) p(\mathbf{y}) \tag{1.4}$$

This equation is instead relevant in what Fawcett and Flach (2005) have called *Y* → *X problems*, i.e., problems in which the class to which a data item **x** belongs probabilistically determines the values of the features in vector **x**. An example of a *Y* → *X* learning problem is authorship attribution, i.e., the task of inferring the author (from a set of |*Y*| candidate authors) of a text of unknown or disputed paternity (Koppel et al., 2009). Authorship attribution, a task which is usually carried out by using as features a number of "stylistic" traits that tend to characterise an author's writing style, is an *Y* → *X* problem, since it is the fact that a certain text is, say, Shakespeare's, that causes it to have certain stylistic characteristics, and not the other way around. In *Y* → *X* problems, when *p(y)* varies across *L* and *U*, it does so "autonomously" (since *y* is a cause, and not an effect); this phenomenon is usually called *prior probability shift* (Storkey, 2009), or sometimes *label shift* (Alexandari et al., 2020).

In the context of text classification, Card and Smith (2018) call the class labels attached to data items in *X* → *Y* problems *extrinsic labels*, while they call the ones in *Y* → *X* problems *intrinsic labels*. The rationale of these names is that in *Y* → *X* problems the labels are intrinsic properties of the data item, and precede the generation of the data item itself, while this is not the case in *X* → *Y* problems. In other words, in *X* → *Y* problems, whether the label of a data item **x** is *y* or not is open to subjective interpretation, while it is not in *Y* → *X* problems.

However, it should be noted that it is not always easy to characterise with certainty a given problem as being of type *X* → *Y* or of type *Y* → *X*; sometimes this question looks a bit akin to wondering which of chicken and egg came first. As a result, different types of shift (covariate shift, concept shift, prior probability shift) that concur in causing distribution shift may be at play at the same time.

In realistic settings, distribution shift is bound to happen at some scale. Its magnitude might just be negligible, in which case the performance of a classifier at deployment will be nearly unaffected, and *pU (y)* will be well approximated by *pL(y)*. However, in the absence of guarantees stemming from domain expertise, a cautious approach will include a procedure to monitor distribution shift and, ideally, a mechanism to adapt to it.<sup>7</sup>

<sup>7</sup> An example test for checking if the class prevalence values have significantly changed from one labelled set to another is the one discussed in Saerens et al. (2002, §3).

# **1.6 Quantification and Bias Mitigation**

Quantification is inherently connected to the notion of *bias*, and to attempts at mitigating it. This is best explained by looking at the behaviour of the Classify and Count method in action. To this aim, let us consider IMDb, a dataset of movie reviews often used for evaluating binary sentiment quantification systems. The dataset consists of 50,000 documents and is perfectly balanced, i.e., there is an equal number of Positive and Negative reviews. Let us split the dataset in two equallysized, perfectly balanced portions, one used for training purposes and another used for testing purposes. Let us use the training portion (containing 25,000 documents) to generate 9 random training samples of 5,000 documents each, at controlled class prevalence values. Specifically, let us sample random training sets *L*10%, *L*20%, *...*, *L*90% with a prevalence value for the class Positive of 10%, 20%, *...*, 90%, respectively. Let us use each training set thus generated to train a classifier (in this case we use an SVM with a linear kernel), and we use each such classifier to implement a basic Classify & Count approach, thus generating a series of quantifiers that we denote by CC10%, CC20%, *...*, CC90%. Let us do something similar in the test portion in order to generate test samples characterised by widely varying class prevalence values. In particular, let us use a finer-grained grid of prevalence values in order to generate test sets *U*0%, *U*5%, *U*10%, *...*, *U*100% of 500 documents each, and let us repeat this process 100 times in order to obtain more reliable results. Finally, we use all of our CC quantifiers to generate predictions for all the test samples. (Experimental protocols like the one we have described here are rather common in the quantification literature, and will be the subject of Section 3.4.) The results of this experiment are reported in Figure 1.5. These plots represent the estimated

**Fig. 1.5** Diagonal plot showing how CC delivers biased estimators of class prevalence values.

prevalence values along the *y*-axis and the true prevalence values along the *x*-axis; we show results averaged across the 100 repetitions, with colour bands representing standard deviation. Since IMDb is binary, we only report results for the Positive class. Such a plot is typically called a "diagonal plot", and will be more thoroughly discussed in Section 6.3.2.

The most important fact that emerges from this figure is that *Classify and Count generates biased estimators of class prevalence*, and is thus (as already anticipated in Section 1.2) a suboptimal quantification method: its prevalence estimates *p*ˆ*Uα (y)* for a class *y* are always intermediate between the true prevalence *α* = *pUα (y)* in the unlabelled set *Uα* and the prevalence *β* = *pLβ (y)* in the labelled set *Lβ* on which the classifier was trained (where *α, β* ∈ [0%*,* 100%]), and are very often much closer to the latter than to the former. In other words, the factor that biases the class prevalence estimators is the class prevalence of the training set: in general, given sets of data *L* and *U*, Classify and Count does not seem to be able to predict class prevalence values for *y* much different from *pL(y)*, even if the true class prevalence value *pU (y)* is faraway from this value. This trend is by no means specific to this dataset, and naturally arises in many different applicative contexts.

This should not surprise us, since standard learning mechanisms assume that the training set *L* and the unlabelled set *U* are IID, i.e., that *pL(***x***,y)* = *pU (***x***,y)*; as a result, the predictor learns from *L* not only the correlation between features and labels (i.e., *p(***x***,y)*), but also the prevalence values of the labels (i.e., *p(y)*). An additional fact that emerges from Figure 1.5 is that the more the training set is imbalanced, the stronger this effect is; in fact, this effect is strongest in the extreme cases concerning the L10% and L90% datasets, while it is weakest in the perfectly balanced L50% dataset.

As much recent research on the fairness and accountability of machine learning methods shows Mehrabi et al. (2019), sample selection bias may be a serious problem, in that it may propagate stereotypes and lead to incorrect decisionmaking. As an example, suppose our aim is to estimate the prevalence of class AfricanAmerican in an unlabelled set representing patients not covered by insurance (Elliott et al., 2009). If the prevalence of this class is .10 in the training set *L*, the estimate *p(*ˆ AfricanAmerican*)* may be close to .10 even if the true prevalence of AfricanAmerican in *U* is, say, .50. This may lead to underestimating racial disparities in healthcare, misguided public health decisions, and diversion of precious resources.

The goal of "genuine" quantification methods (i.e., methods different from Classify and Count) is thus to eliminate, or at least mitigate, this bias; aside from Classify and Count, Figure 1.6 plots the results of CC along with two other methods for learning to quantify (in this case, all methods are trained on a perfectly balanced subset of 5,000 documents), and the fact that the curves corresponding to these other methods are closer to the diagonal line than the Classify and Count curve, shows that these other methods succeed, in varying degrees, in mitigating this bias. More on this in the sections to follow.

**Fig. 1.6** Diagonal plot showing quantification methods that succeed in mitigating bias for the IMDb dataset.

## **1.7 Structure of This Book**

In the above sections we have motivated why quantification is an interesting problem, why it should be addressed as a task of its own instead of as a byproduct of classification, how it is rooted in the fundamental problem of dataset shift, and how one of its goals is to mitigate the bias of which Classify and Count suffers.

The rest of this book is structured as follows.

Section 2 examines the applications of quantification. Special emphasis is given to the fields of human inquiry which are devoted to studying phenomena only at a collective level, such as market research, political science, the social sciences, ecological modelling, and epidemiology. However, we also pay attention to quantification as a means to improve, in scenarios characterised by distribution shift, the accuracy of *classification*, and this may in turn have an impact on many diverse fields and applications.

In Section 3 we turn our attention to the issue of how to experimentally evaluate quantification algorithms. A large part of this section is devoted, as should be expected, to discussing the various measures that have been proposed over the years for evaluating quantifiers. However, we also pay attention to the different experimental *protocols* that have been used in different works for carrying out the evaluation, protocols that differ essentially in terms of the stand they take towards relying or not on artificially generated samples.

Section 4 is devoted to presenting supervised learning methods for performing quantification, starting with "aggregative" methods (i.e., methods that involve the classification of individual items as a preliminary step) and ending with "nonaggregative" ones (i.e., methods that analyse the sample "holistically", without issuing individual classification decisions). In the course of this discussion, attention is paid both to methods that rely on "general-purpose" learners (i.e., ones that had originally been designed for tasks other than quantification) and to methods that are based on "special-purpose" learners (i.e., learners designed with quantification in mind).

In Section 5 we look at some advanced, "niche" topics, including quantification for ordinal codeframes, regression quantification, text quantification in cross-lingual settings, quantification for networked data, quantification for data streams, and others.

Section 6 takes a look back at the historical development of quantification as a task, and how (as many other tasks) it has witnessed independent contributions from researchers coming from different areas (machine learning, data mining, statistics, information retrieval), sometimes unaware of the developments that had gone on in other areas. This section also describes publicly available software packages and a brief tour of experimental results and visualisation tools to present them. Finally, we look at related tasks, spelling out the differences between them and learning to quantify. We conclude in Section 7, hinting at open problems and possible areas of further investigation.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 2 Applications of Quantification**

Broadly speaking, there are two reasons why one might want to perform supervised prevalence estimation:


# **2.1 Improving Classification Accuracy**

The presence of dataset shift can damage the accuracy of a machine-learned classifier, because essentially all classifier training algorithms are based on the IID assumption (i.e., perform at their best when the training set *L* and the set *U* of unlabelled items are IID), an assumption which dataset shift invalidates. A particularly illuminating example of why distribution shift, of all the types of dataset shift that can occur, can make a classifier perform sub-optimally, is the Bayes optimal classifier, which is given by

$$\begin{aligned} h(\mathbf{x}) &= \arg\max\_{\mathbf{y}} p(\mathbf{y}|\mathbf{x}) \\ &= \arg\max\_{\mathbf{y}} \frac{p(\mathbf{x}|\mathbf{y})p(\mathbf{y})}{p(\mathbf{x})} \end{aligned} \tag{2.1}$$

© The Author(s) 2023 A. Esuli et al., *Learning to Quantify*, The Information Retrieval Series 47, https://doi.org/10.1007/978-3-031-20467-8\_2

19

This equation shows that the posterior probabilities *p(y*|**x***)* generated by the classifier (and, in turn, the classification decision arg max*<sup>y</sup> p(y*|**x***)*) depend on the class prevalence values *p(y)*, which are estimated on *L*. In the presence of distribution shift between *L* and *U* this estimation will be inaccurate, and the quality of the posterior probabilities (and of the final decision) will be negatively influenced. For instance, if *pU (y) > pL(y)*, then *p(y*|**x***)* as deriving from Equation 2.1 will be smaller than it should be, and *y* will have a lower-than-ideal chance to be picked as the label for **x**.

In order to improve the quality of both the posterior probabilities and the classification decisions generated by the classifier, we would need to use, in Equation 2.1, the value *pU (y)* in place of the value *pL(y)* that is normally used. Since *pU (y)* is unknown, one possibility is to use quantification methods to estimate it.

More precisely, what the use of quantification methods allows to do is to improve the *calibration* of the posterior probabilities. An intuition of what "calibrated probabilities" means is given by the following example. For instance, if only 10% of all the data items **x** for which *p(y*|**x***)* = *.*5 indeed belong to *y*, we can say that the classifier has overestimated the probability that these items belong to *y*, and that their posteriors are thus inaccurate; if this percentage is instead 90%, we can say that the classifier has underestimated this probability, again resulting in inaccurate posteriors. Indeed, we say (see e.g., Flach, 2017) that the posteriors *p(y*|**x***)*, where the data items **x** belong to a sample *σ*, are (perfectly) *calibrated* (i.e., accurate) when, for all *<sup>a</sup>* ∈ [0*,* <sup>1</sup>], it holds that1

$$\frac{|\{\mathbf{x} \in \sigma \mid p(\mathbf{y}|\mathbf{x}) = a, \Phi(\mathbf{x}) = \mathbf{y}\}|}{|\{\mathbf{x} \in \sigma \mid p(\mathbf{y}|\mathbf{x}) = a\}|} = a \tag{2.2}$$

Even assuming that our learner generates classifiers that tend to return wellcalibrated probabilities, the classifier is calibrated for the training set *L*, which means that, in the presence of distribution shift, it cannot be calibrated for *U* too. The posteriors *p(y*|**x***)* can be re-calibrated (i.e., tuned on the unlabelled data) by multiplying them by *pU (y)/pL(y)*, but in order to do this, *pU (y)* needs to be estimated, which is where quantification comes into play. Well calibrated probabilities are important in a number of tasks, including (aside standard classification, as argued earlier in this section) (a) cost-sensitive classification (Elkan, 2001), (b) risk assessment and minimisation (as in credit scoring Hand and Henley, 1997 or in technology-assisted review Oard et al., 2018), and (c) ranking classes in terms of their suitability to a data item (Makris et al., 2007).

Most works that use quantification in order to improve classification accuracy do so, as explained above, by trying to improve the quality of the posterior probabilities;

<sup>1</sup> Perfect calibration is usually unattainable on any non-trivial dataset; however, calibration comes in degrees (and the quality of calibration can indeed be measured, via functions such as *calibration error*), so efforts can be made to obtain posteriors that are as close as possible to their perfectly calibrated counterparts.

this is the route that Alaíz-Rodríguez et al. (2011), Saerens et al. (2002), Vucetic and Obradovic (2001), Xue and Weiss (2009), and Zhang and Zhou (2010) follow. A different line of research is that of Balikas et al. (2015), who use quantification for optimising the parameters of the classifier in semi-supervised classification contexts in which there are not enough labelled validation data to optimise the parameters on.

## *2.1.1 Word Sense Disambiguation*

Chan and Ng (2005, 2006) show that *Word Sense Disambiguation* (WSD) is a particularly interesting application context in which one might want, as discussed above, to improve the quality of the posterior probabilities with the goal of improving classification accuracy. WSD is the task of predicting, given a natural language sentence in which an ambiguous word occurs, which of the senses that this word has is the intended one. For each word it is assumed that there are a finite number of senses and that these senses are known in advance; as a result, this is a classification task, where the occurrence of the word is the item to classify and the senses of the word are the classes. As a result, word sense disambiguators are usually classifiers trained on corpora of sense-tagged texts.

However, these classifiers are often influenced by the sense priors of the corpora they have been trained on. For instance, assume that the word to disambiguate is bank, that one of its senses is that of a financial institution (as in the bank round the corner) and another of its senses is that of a hydraulic artefact (as in the banks of river Thames). Assume that such a classifier has been trained on a corpus *L* of financial texts; in this case the prevalence of the former sense will be much larger than that of the latter sense. Assume also that the trained classifier is used to disambiguate a set *U* of texts about hydraulic engineering; in this case many occurrences of word bank will have the latter sense but will be wrongly attributed to the former one, since the classifier is biased towards the financial sense of the word. One might thus want to recalibrate on the unlabelled set *U* the posterior probabilities of the different word senses, and the way Chan and Ng (2005, 2006) do so is by using quantification in the way discussed above.

Note that this is just an instance of the general process of adapting a classifier trained on a "source" domain to a different, "target" domain, a process known as *transfer learning* (Vilalta et al., 2011) which has countless applications. Since no part of the process described above is specific to word sense disambiguation, this suggests that quantification may play an important role in several other contexts in which transfer learning is used.

## **2.2 Fairness**

## *2.2.1 Improving Fairness*

Quantification can be used to improve not only the accuracy of a classifier *h* but also its *fairness*, i.e., its ability to avoid propagating prejudice, inequity, and partisanship. Biswas and Mukherjee (2021) use quantification in order to make sure that a classifier *h* does not promote discrimination with respect to a sensitive attribute, such as race or gender, and do so by introducing the notion of *Proportional Equality* (PEq). Suppose we mark a given attribute *s* as "sensitive" or "protected", i.e., we want to impose that it should not be a basis for discrimination. For the sake of exposition, let us consider binary sex as sensitive, with class values *c* ∈ *S* = {♂*,* ♀}. It might well be that our training set *L* is sex-biased, i.e., for a certain class *y* it happens that *pL(y*|♂*)* (the prevalence of *y* in the set of male individuals belonging to *L*) is substantially different from its female counterpart *pL(y*|♀*)*; for instance, if *<sup>y</sup>* corresponds to the class of Engineers, it might happen that *pL(y*|♂*) pL(y*|♀*)*. It might also happen that our set *U* of unlabelled items does not have this bias, i.e., it does not hold that *pU (y*|♂*) pU (y*|♀*)*. In this case, we would not want the bias in *L* to influence the way the data in *U* are classified.<sup>2</sup> This can be achieved by imposing proportional equality, i.e., imposing that

$$\text{PEq} = \left| \frac{p\_U^h(\hat{\mathbf{y}}|\mathcal{O})}{p\_U^h(\hat{\mathbf{y}}|\mathbb{Q})} - \frac{p\_U(\mathbf{y}|\mathcal{O})}{p\_U(\mathbf{y}|\mathbb{Q})} \right| \le \epsilon \tag{2.3}$$

where *p<sup>h</sup> <sup>U</sup> (y*ˆ|*c)* represents (see Table 1.1) the fraction of members of sample *U* that belong to *c* and to which classifier *h* has assigned class *y*. In other words, Equation 2.3 prescribes that the way the labels assigned by the classifier are distributed in *U* is "fair", i.e., reflects the way they are actually distributed in *U*. Of course, this latter distribution is unknown; the idea is thus to estimate it via quantification methods, and plug the resulting estimate of PEq into an optimisation procedure aimed at minimising it.

<sup>2</sup> A well-known example comes from machine translation. In the past, it was reported that services such as Google Translate or Microsoft Translator, when translating into English from genderneutral languages such as Turkish (where, e.g., the personal pronoun "o" is used for males and females alike), tended to associate words such as "doctor" to male pronouns ("O bir doktor" → "He is a doctor"), while they tended to associate words such as "cook" to female pronouns ("O bir ahci" → "'She is a cook'), presumably due to gender bias present in the text corpora the translation service had been trained on. See (Emel Ince, *Do the footprints of stereotyping and gender bias follow us in online environments?*, 2018, https://www.capstan.be/do-the-footprintsof-stereotyping-and-gender-bias-follow-us-in-online-environments/, retrieved on Feb 28, 2020) for the full story.

## *2.2.2 Measuring Fairness*

Quantification methods are also suited to "measuring (classifier) fairness under unawareness", i.e., providing estimates of the fairness of classifiers with respect to a sensitive attribute (e.g., race, sex) in situations where the values of the sensitive attribute are not available at classifier training and/or test time. This is a common setting in practice, due to several factors, including legislation on demographic data collection (Bogen et al., 2020), privacy-by-design standards, and a data minimisation ethos (Andrus et al., 2021). For this reason, the problem of measuring fairness under unawareness has become important for many practitioners interested in evaluating the differential impact of their classifiers across salient subpopulations, identified by sensitive attributes whose ground truth values are not known (Holstein et al., 2019).

Fabris et al. (2021) adapt quantification approaches to tackle the fairness-underunawareness problem. For the sake of exposition, let us focus on "demographic parity" (Barocas et al., 2019; Calders and Verwer, 2010), a measure of classifier fairness focused on the difference in the values of "acceptance rate" (i.e., the fraction of data items that are assigned the class of interest) across different subpopulations (determined by sensitive attribute *s*) for a classifier *k* : *X* → *Y*, issuing predictions *k(***x***)* for a target variable (e.g., employability) across the data points (e.g., candidates).3 Let us consider again, for the sake of exposition, binary sex as the sensitive attribute *s*, with class values *c* ∈ *S* = {♀*,* ♂}, and employability as the target variable, with class values *y* ∈ *Y* = {⊕*,* } (where we assume that ⊕ stands for "Hire" and stands for "Turn down"). The demographic disparity (DD) of classifier *k* with respect to sensitive attribute *s* is defined as

$$\text{DD}(k, s, \sigma) = p\_{\sigma}^{k}(\hat{\oplus}|\varphi) - p\_{\sigma}^{k}(\hat{\oplus}|\sigma) \tag{2.4}$$

where

$$p\_{\sigma}^{k}(\hat{\oplus}|c) = p\_{\sigma}^{k}(c|\hat{\oplus}) \frac{p\_{\sigma}^{k}(\hat{\oplus})}{p\_{\sigma}(c)}\tag{2.5}$$

is the acceptance rate for class *c*, and where Equation 2.5 is just an application of Bayes' theorem. Under this measure, classifiers are considered fair if their DD is close to zero, while extreme values of −1 or +1 indicate maximum unfairness, since the difference in acceptance rates across sensitive subpopulations is maximum. If *k(***x***)* represents the employability of candidates, *k* as applied to *σ* is considered maximally fair under DD if the probability *p<sup>k</sup> <sup>σ</sup> (*⊕ˆ *)* of being hired is the same for

<sup>3</sup> In this section, we let *k(***x***)*, instead of *h(***x***)* as defined in Table 1.1, denote the hard classifier issuing predictions in *Y*, since here sensitive attributes in *S* are the target of quantification. In other words, we reserve the notation *h(***x***)* for a hard classifier issuing predictions in the same domain of the quantification task.

males and females, which would mean that DD*(k, s, σ )* = 0. Due to the difficulties in demographic data procurement outlined above, the values for sensitive attribute *s* are often unknown at classifier training and/or test time. The value of *p<sup>k</sup> <sup>σ</sup> (*⊕| ˆ *c)* can be computed if we have reliable estimates of groupwise prevalence values *p<sup>k</sup> <sup>σ</sup> (c*|⊕ˆ *)* and *p<sup>k</sup> <sup>σ</sup> (c*|ˆ *)*, since Equation 2.5 can be re-written as

$$p\_{\sigma}^{k}(\hat{\oplus}|c) = p\_{\sigma}^{k}(c|\hat{\oplus}) \frac{p\_{\sigma}^{k}(\hat{\oplus})}{p\_{\sigma}^{k}(\hat{\oplus}) \cdot p\_{\sigma}^{k}(c|\hat{\oplus}) + p\_{\sigma}^{k}(\hat{\ominus}) \cdot p\_{\sigma}^{k}(c|\hat{\ominus})} \tag{2.6}$$

In other words, since *p<sup>k</sup> <sup>σ</sup> (*⊕ˆ *)* and *<sup>p</sup><sup>k</sup> <sup>σ</sup> (*ˆ *)* are available, DD*(k, s, σ )* (Equation 2.4) can be readily estimated by leveraging quantification methods that provide estimates *pk <sup>σ</sup> (c*|⊕ˆ *)* and *<sup>p</sup><sup>k</sup> <sup>σ</sup> (c*|ˆ *)*. A necessary requirement for this is the availability of a (possibly small) auxiliary annotated dataset *L* in which the values of the sensitive attribute are the labels. This dataset is to be used for training the quantifier that must be applied to *σ*, and may derive from voluntary data disclosures, surveys, or other targeted efforts.

However, because of their nature, these datasets are likely to suffer from selection bias, and unlikely to be fully representative of the deployment conditions. Fabris et al. (2021) show that quantification methods are particularly suited to tackle the fairness-under-unawareness problem, as they are robust to the inevitable distribution shift that derives e.g., from selection bias. Moreover, the authors show that quantification methods can effectively decouple the (desirable) objective of measuring classifier fairness from the (undesirable) side effect of allowing the inference of the sensitive attribute values of individuals, thus reducing the potential for model misuse at the individual level (e.g., profiling).

## **2.3 Sentiment Analysis**

Sentiment classification, the task of classifying a piece of text about a certain object as expressing a positive, neutral, or negative sentiment toward that object, has become a ubiquitous enabling technology, with applications in many fields, including financial news analysis, brand positioning and reputation management, stock market prediction, customer relationships management, and others.

Of interest to us is the fact that, while in some applications the sentiment of a specific individual is of interest, in other cases the application requirements only involve assessing the sentiment of a certain population (Esuli and Sebastiani, 2010b); for instance, while customer relationship management is typically an application of the former type, brand positioning is usually concerned with collective sentiment only. In particular, Gao and Sebastiani (2016) observe that most endeavours having to do with sentiment classification in Twitter are really about *sentiment quantification*, since hardly anybody who sets out to classify tweets by sentiment, is interested in the sentiment expressed in specific, individual tweets.

Collective sentiment is often an object of study in the social and political sciences, as well as in market research; much of what is discussed in the next two sections, which are devoted to applications in these disciplines, touches on sentiment-related issues too.

## **2.4 Social and Political Sciences**

The social and political sciences are disciplines in which individual cases hardly matter, and where the interest is instead on phenomena that require analysis at the aggregate level.

One of the many examples of this (and of the rising field of *computational social science*) is illustrated in Figure 2.1 (from Dodds et al., 2011). Here, the authors set out to study the temporal patterns of happiness in the population of Twitter users. Essentially, what the authors do is to engage in some type of sentiment classification (Happy vs. Unhappy) that detects whether a certain tweet denotes happiness or unhappiness, bin the results according to the time and date the corresponding tweets were issued, and plot the Happy and Unhappy relative frequencies of the corresponding bins on a temporal scale. This endeavour has two characteristics that are of interest to us. The first is that the objects of interest (the tweets) are unlabelled, i.e., it is not known (and it is not possible to deterministically determine) whether they are representative of the class Happy or not. The second is that the authors are not interested in individual tweets, but in the big picture,

**Fig. 2.1** Temporal patterns of happiness as resulting from a Twitter study (from (Dodds et al., 2011)).

**Fig. 2.2** Temporal trend in the proportions of tweets supporting or opposing military intervention in Egypt during the "Arab spring" in summer 2013 (from Borge-Holthoefer et al., 2015).

i.e., in the proportions (at different time points) of tweets that belong or do not belong to class Happy. This is thus a case where quantification (actually: sentiment quantification) techniques could have been applied. Yet a further example is reported in Figure 2.2 (from Borge-Holthoefer et al., 2015), where the authors mine the Twittersphere in order to determine (among other things) the prevalence values of the pro-military-intervention stance vs. the against-military-intervention stance concerning the summer 2013 "Arab spring" in Egypt. Again, we have a combination of unlabelled data and interest at the aggregate level only, which would have made this research suitable for the application of *stance quantification* (see Walker et al., 2012).

Concerning the fact that social scientists are interested in phenomena that require analysis at the aggregate level, we simply echo the words of Hopkins and King (2010), who have been the first to use a non-trivial quantification method (i.e., one different from "classify and count") for political analysis reasons:

When social scientists use formal content analysis, it is typically to make generalisations using document category proportions. (...)

Policy-makers or computer scientists may be interested in finding the needle in the haystack (. . . ) but social scientists are more commonly interested in characterising the haystack. (. . . ) Although computer scientists have methods for automated content analysis, most are optimised to classify individual documents, whereas social scientists instead want generalisations about the population of documents, such as the proportion in a given category.

In their work, these authors use the ReadMe algorithm (that we will analyse in detail in Section 4.4.1) with the aim of estimating the prevalence of different political candidates in bloggers' preferences through an analysis of their blog posts. In other works in a similar spirit, researchers have variously tried to estimate the distribution of press releases related to legislators' credit claiming efforts (Grimmer et al., 2012), to estimate the prevalence of different types of censored news in Chinese media (King et al., 2013), and to estimate the distribution of citizens' political preferences by performing sentiment analysis on tweets (Ceron et al., 2014).

It has to be noted that, while the above-mentioned works use a quantification method other than Classify and Count, the vast majority of works in the social and political sciences that make use of supervised learning use Classify and Count (see Mandel et al., 2012 for an example), no doubt due to a lack of awareness of the sub-optimality of this strategy, of the existence of better alternatives, and of the very existence of quantification as a task. This state of affairs is not limited to the social and political sciences, though, and cuts across all the disciplines to which quantification has been applied (or could be applied), and which will be mentioned in the next sections.

## **2.5 Market Research**

The goal of market research is to obtain information concerning the desires and needs of actual or potential customers of products or services. This information is usually collected through surveys, conducted by a survey specialist and involving a number of respondents. Conducting a survey usually involves a questionnaire, i.e., a list of questions which respondents are asked to answer. The majority of questions to be found in questionnaires are of the "closed" type, where the respondent is required to tick one of a predefined set of answers. Open (a.k.a. "open-ended") questions instead involve returning a textual answer. When computing the results of the survey, in order to manage open questions the survey specialist first defines a set of classes of interest for the given application (e.g., HatesSitComs, WantsMoreSoaps, etc., for a survey run on behalf of a TV network), and then classifies (either manually or via a machine-learned classifier) each answer based on its textual content. The results of the survey are then obtained by checking how many respondents' answers have been attributed which class. Quite obviously, the focus on "how many" (as opposed to "which") in the previous sentence indicates that the survey specialist needs, rather than a classifier, a quantifier of open-ended answers.

The use of non-trivial methods for performing quantification for open-ended answers in market research has been proposed only very recently (Sebastiani, 2018). Unsurprisingly, previous literature (see Esuli and Sebastiani, 2010a for an example) just reports uses of Classify and Count, for pretty much the same reasons as described in Section 2.4.

## **2.6 Epidemiology**

Epidemiology is a discipline with traits analogous to the social and political sciences, since the objects of study are individuals but the quantities of interest are only indicators at the aggregate level. Epidemiologists try to obtain estimates of disease prevalence values across different geographical regions, time periods, age groups, or gender. (See the example in Figure 2.3.) These prevalence estimates are important in assessing the spread of infectious diseases, in assessing the impact of toxic environmental conditions, in planning and allocating health services, and in measuring health risks.

One way quantification may be applied to epidemiology is in establishing disease prevalence by analysing, via text quantification techniques, clinical reports of a textual nature. One such example is reported in Baccianella et al. (2013), where the authors quantify (using Classify and Count) over a dataset of radiology reports. Applications such as this are difficult, especially when it comes to rare diseases. In these cases, as already mentioned in Bullet 2 of Section 1.5, when creating the training set the classes that represent rare illnesses need to be oversampled, in order to improve the accuracy of the predictor; the distribution of these classes in the training data may thus be *very* different from the distribution in the unlabelled data, thus generating situations of extreme distribution shift.

**Fig. 2.3** The prevalence of tubercolosis in 2019, expressed as number of cases per 100,000 inhabitants (from the *Global Tubercolosis Report*, Geneva: World Health Organization, 2020).

Another (perhaps more peculiar) application of quantification to epidemiology that has been reported is the estimation of the prevalence of various causes of death via "verbal autopsies" (King and Lu, 2008). A verbal autopsy is a textual description of the symptoms that a deceased person exhibited before dying; this description may be obtained from family members or other caretakers. Such a description may be used in order to later establish the causes of death of the deceased in situations in which a doctor entitled to certify these causes is not available; example such scenarios are remote villages in developing countries, or in areas faraway from hospitals. Using a verbal autopsy in such a way can be framed as a supervised classification problem, using classification schemes where all known causes of death are organised in a taxonomy: a text classifier classifies the verbal description of symptoms (which can be represented as a vector of features) and assigns it the classes in the taxonomy that befit this description. In order to train such a classifier, training data may be obtained at hospitals, since for patients that have deceased in a hospital both a description of symptoms obtained from nurses and doctors *and* the causes of death as certified by a doctor, are typically available. When these causes of death are needed for reasons other than epidemiological ones, a classifier is the desirable tool; for the needs of epidemiology, instead, a quantifier is the most adequate one. Note that in typical scenarios of the above type, distribution shift is at play, for various reasons. One reason is the same as mentioned for the analysis of clinical reports, i.e., rare diseases requiring oversampling for creating the training data. A second reason is that the training set may have been collected by merging different datasets from hospitals in different geographical areas. Yet another reason is that local environmental conditions in the application scenario (e.g., a nearby toxic industrial plant) may make these conditions irreproducible at training time. One application of quantification to establishing the prevalence of causes of death for epidemiological purposes is reported in King and Lu (2008) and King et al. (2010), where the ReadMe quantification method (to be discussed in Section 4.4.1) is used.

A further type of application involves analysing social media posts in order to obtain indicators and trends related to public health. As reported by Daughton and Paul (2019), "classifying and counting" posts that allow to infer specific health-related characteristics of the person who has posted them, has been widely used, for applications ranging from influenza surveillance to measuring attitudes towards vaccination. However, only Daughton and Paul (2019) themselves, in yet another influenza surveillance application, have approached these problems by using quantification methods other than the trivial Classify and Count.

# **2.7 Ecological Modelling**

When attempting to characterise ecosystems in order to allow their management and preservation, ecologists often need to assess the distribution of certain species across land and sea. When individual living beings cannot be characterised with certainty as belonging to a certain species or not, classification (carried out either manually or via a trained classifier) needs to be employed. However, ecologists are often interested in characterising not individual living beings, but entire populations of them; this is where quantification comes into play. A work in this direction is the one by González et al. (2017), which applies quantification technology in order to estimate the distribution of various plankton species in images of sea water samples.

A similar situation arises with land cover (LC) mapping. Given an aerial (e.g., satellite) image, LC mapping has to do with characterising how much of the territory represented in the image is covered in, e.g., forest, cultivated land, water, urban areas, etc. In order to do so, each pixel of the image is classified by an automated classifier as belonging to one of the above LC types. However, since we are only interested in predicting *how much* of the territory represented in the image is covered in a certain LC type, we may replace a classifier with a quantifier. This is the approach taken by Latinne et al. (2001), who (using the method that we will describe in Section 4.2.9) perform LC mapping on Landsat satellite images.

A work in the same spirit is Beijbom et al. (2015), where quantification is used to monitor the world's coral reefs by performing quantification on underwater images of seabed cover. Here, the objective is to estimate the percentages of seabed covered by each of 32 different species (see Figure 2.4), and distribution shift is caused by the fact that the location where the training images have been acquired is typically different, and thus exhibits a different distribution of species, from the locations where the unlabelled images are obtained (see Figure 1.2).

**Fig. 2.4** Class prevalence of each of 32 living species in seabed cover as estimated via quantification technology (from Beijbom et al., 2015); the different columns represent different samples on which quantification has been performed.

## **2.8 Resource Allocation**

Companies need to carefully plan how to allocate and distribute human resources to specific departments of the company, and must anticipate the needs of these departments in order not to be caught off-guard when the amount of work for a certain department spikes due to unusual circumstances.

In a series of papers, Forman (2005, 2006, 2008) describes an application of supervised prevalence estimation to resource allocation within a company. His work consists of automatically classifying the transcripts of phone calls received at the customer support department of a large IT company, where the classes are the different types of issues that such a customer support department is routinely asked to solve. Since the goal is to detect which issues are more prevalent, and thus need more personnel to be allocated to them, he proposes to use text quantification (instead of classification) technology. A correct estimation of the prevalence of the different issues not only allows a more adequate allocation of human resources: if performed systematically it allows to identify increasingly prevalent issues before they get out of control, to monitor if the resource allocation thus performed has been effective, and to focus product re-engineering / redesign efforts on the areas where this effort is most needed.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 3 Evaluation of Quantification Algorithms**

As all other supervised learning algorithms, quantification algorithms must be subjected to a thorough experimental evaluation, and a pillar of this evaluation is the mathematical measure to be used. Sections 3.1 to 3.3 thus review the main evaluation measures for quantification that have been proposed for the various problems discussed in Section 1.4.

As already hinted in Section 1.1, quantification may be seen as generating a predicted distribution *p*ˆ over *Y* that approximates a true distribution *p* over *Y*. Evaluating quantification thus means measuring how well *p*ˆ fits *p*. We will thus be concerned with discussing functions that attempt to measure this goodness-offit; we hereafter use the notation *D(p, p)*ˆ to indicate such a function.

In this book we assume that the evaluation measures we are concerned with are measures of quantification error, and not of quantification accuracy. The reason for this is that most, if not all, the evaluation measures for quantification that have been used so far are indeed measures of error, so it would be slightly unnatural to frame everything in terms of quantification accuracy. Since any measure of accuracy can be turned into a measure of error (typically: by taking its negation), this is an inessential factor anyway.

A further problem in evaluating quantification is how we should choose the dataset and the samples on which to carry out this evaluation, in order for them to be representative of the scenarios encountered in real-world applications. This is a particularly thorny issue, since available datasets might not exhibit the type of shift, or the amount of shift, that we might want our quantifiers to be robust to. Section 3.4 thus discusses the different experimental protocols that have been proposed in the literature in order to address this problem.

# **3.1 Measures for Evaluating SLQ, BQ, and MLQ**

In this section we will discuss a number of evaluation measures that have been proposed in the literature for evaluating single-label quantification.1 As mentioned in Section 1.4, these measures can also be used in order to evaluate binary quantification and multi-label quantification, since BQ is a special case of SLQ, and since evaluating the error of a multi-label quantifier can be done by evaluating its BQ error for each *y* ∈ *Y*. Many of the measures that we discuss here were originally proposed for BQ, but can be easily extended to deal with SLQ in general; we present them in their SLQ form, even when they were originally proposed in BQ form.

Essentially all evaluation measures that have been proposed in the quantification literature are *divergences*. Formally, a divergence *D* is a measure of how a predicted distribution *p*ˆ "diverges" (i.e., differs) from the true distribution *p*, and is such that (1) *D(p, p)*ˆ = 0 if and only if *p* = ˆ*p*, and (2) *D(p, p) >* ˆ 0 for all *p*ˆ = *p*. As an aside, note that two distributions *p* and *p*ˆ over *Y* are essentially two nonnegative-valued, length-normalised vectors of dimensionality |*Y*|. The literature on the evaluation measures for quantification thus obviously intersects the literature on functions for computing the similarity of two vectors.

We here need to stress a key difference between measures of classification accuracy and measures of quantification accuracy (or error). The objects of classification are individual unlabelled items, and all measures of classification accuracy (e.g., *F*1) are defined with respect to a test *set* of such objects. The objects of quantification, instead, are samples, and all the measures of quantification error we will discuss in this book are defined on a *single* such sample (i.e., they measure how well the true distribution of the classes *across this individual sample* is approximated by the predicted distribution of the classes across the same sample). Since every evaluation is worthless if carried out on a single object, it is clear that quantification systems need to be evaluated on *sets* of samples. This means that every measure that we are going to discuss needs first to be evaluated on each sample, and then its global score across the test set (i.e., the set of samples on which testing is carried out) needs to be computed. This global score may be computed via any measure of central tendency, e.g., via an average, or a median, or other.

<sup>1</sup> The measures discussed in this section are just the most frequently used ones, and are by no means an exhaustive list. E.g., other functions that have occasionally been used as evaluation measures for quantification are the *Pearson Divergence* (Ceron et al., 2016) and the *Discordance Ratio* (Levin and Roitman, 2017).

# *3.1.1 Properties of Evaluation Measures for SLQ, BQ, and MLQ*

The most thorough published study of evaluation measures for SLQ is probably (Sebastiani, 2020). This paper defines a number of interesting formal properties that an evaluation measure for SLQ may or may not enjoy, discusses if (and when) each of these properties is desirable, and analyses whether the evaluation measures that have been used in the quantification literature enjoy them or not; this process is typical of the so-called *axiomatic approach* to "evaluating evaluation", i.e., to the study of evaluation measures (Busin and Mizzaro, 2013), an approach that has also been applied to other tasks such as classification and clustering. A significant result of this paper is that no existing evaluation measure for SLQ satisfies all the properties identified as desirable; still, some evaluation measures are proven to be "less inadequate" than others. We here briefly discuss four main such properties, mostly by way of examples. Sebastiani (2020) discusses still other properties, but these are satisfied by all the evaluation measures for quantification proposed in the literature, and as such are less interesting.

The first property we discuss here is *Maximum* (**MAX**). Basically, an evaluation measure for SLQ that enjoys **MAX** is one whose values are upper-bounded by a value *β >* 0, which is the same for all *Y* and for all *p*, and which is such that *D(p, p)*ˆ = *β* for at least one predicted distribution *p*ˆ, called the *perverse* (i.e., worst possible) *estimator*. An evaluation measure that enjoys **MAX** is such that its range (or better: its image) is independent of the problem setting, and this allows to easily judge whether a given value of *D* means high or low quantification error; in other words, should this range depend on *Y*, or on its cardinality, or on the true distribution *p*, we would not be able to easily interpret the meaning of a given value of *D*. An additional, possibly even more important reason for requiring this range to be independent of the problem setting is that, in order to test a given quantification method, the measure needs (as noted above) to be evaluated on a set of *n* test samples *σ*1*,...,σn* (each characterised by its own true distribution), and a measure of central tendency across the *n* resulting values then needs to be computed. If, for these *n* samples, the measure ranges on *n* different intervals, this measure of central tendency will return unreliable results, since the results obtained on the samples characterised by the wider such intervals will exert a higher influence on the resulting value.

The second property is *Impartiality* (**IMP**). In essence, an evaluation measure *D* that enjoys **IMP** equally penalises the underestimation of a true prevalence *p(y)* by an amount *a* (i.e., returning *p(y)* ˆ = *p(y)* − *a*) or its overestimation by the same amount *a* (i.e., returning *p(y)* ˆ = *p(y)* + *a*). This makes sense, because underestimation and overestimation should be considered equally undesirable, unless there is a specific reason (i.e., application need) for not doing so; in the latter case, the measure we choose should make its bias explicit, i.e., include a tunable parameter (similar in spirit to the *β* parameter of *Fβ*) that allows specifying *how much* underestimation should be penalised more/less than overestimation.

The third property is *Relativity* (**REL**). In a nutshell, an evaluation measure that satisfies **REL** sanctions that an error of absolute magnitude *a* (i.e., the error made when *p(y)* ˆ = *p(y)* ± *a*) is more serious when the true class prevalence is smaller. In some applications of quantification **REL** is indeed desirable. Consider, as an example, the case in which the prevalence *p(y)* of a certain cause of death *y* in a population has to be estimated, as discussed in Section 2.6, from "verbal autopsies". In this case, the evaluation measure should arguably enjoy **REL**; in fact, predicting *p(y)* ˆ = 0*.*0101 when *p(y)* = 0*.*0001 is a much more serious mistake than predicting *p(y)* ˆ = 0*.*1100 when *p(y)* = 0*.*1000, since in the former case a very rare cause of death is overestimated by two orders of magnitude, while the same is not true in the latter case.

The fourth property is *Absoluteness* (**ABS**), and is the opposite of **REL**. Basically, an evaluation measure that satisfies **ABS** sanctions that an error of absolute magnitude *a* should be penalised independently of the value of the true class prevalence. Of course, an evaluation measure cannot enjoy **REL** and **ABS** at the same time; however, while there are applications that require **REL**, other applications require **ABS**. Consider an example in which we want to predict the prevalence of the NoShow class among the passengers booked on a flight with actual capacity *n* (so that the airline can "overbook" additional *p(*ˆ NoShow*)* × *n* seats). Here the evaluation measure should enjoy **ABS**, since returning *p(*ˆ NoShow*)* = 0*.*05 when *p(*NoShow*)* = 0*.*10 or returning *p(*ˆ NoShow*)* = 0*.*15 when *p(*NoShow*)* = 0*.*20 brings about the same cost to the airline (i.e., that 0*.*05×*n* seats will remain empty).

Note that, while **REL** and **ABS** are mutually exclusive, they do not cover the entire space of possibilities, i.e., there can be measures that enjoy neither **REL** nor **ABS**. One such measure is *cosine distance*, which, as it can be shown, considers an error of absolute magnitude *a less* serious when the true class prevalence is smaller.<sup>2</sup>

We will frame the discussion of evaluation measures for SLQ in terms of these four properties; for each such property and for each measure discussed in the following sections, Sebastiani (2020) presents proofs of whether the measure enjoys or does not enjoy the property.<sup>3</sup>

<sup>2</sup> Cosine distance will not be discussed any further in this book, because it has never been proposed or used as an evaluation measure for SLQ, and because a measure that enjoys neither **REL** not **ABS** is arguably of little use in any application of quantification.

<sup>3</sup> Note that there are several other properties that the literature on divergence functions and distance functions discusses, and that we do not consider here because we do not deem them interesting when it comes to evaluating quantification. For instance, one of them is *symmetry*, i.e., the property that states that for any two distributions *p* and *p* it holds that *D(p , p )* = *D(p , p )*; in evaluating quantification we are not interested in symmetry, because our two distributions are not just any two distributions, but are always a true distribution and a predicted distribution, and switching their roles is not interesting.

## *3.1.2 Bias*

*Bias* (B), defined as

$$\mathbf{B}(\mathbf{y}) = \hat{p}(\mathbf{y}) - p(\mathbf{y}) \tag{3.1}$$

is technically not an evaluation measure for quantification as we have defined it before, since it does not apply to an entire distribution *p* but only to a specific label *y*. Even when using it in a binary setting, one thus needs to specify which of the two classes it is applied to. It is a fairly simplistic measure, and we cover it only since it has been used in several papers on quantification (e.g., Forman, 2005, 2006; Tang et al., 2010).

A positive B score indicates that the prevalence of *y* has been overestimated, while a negative score indicates that it has been underestimated. If used as an evaluation measure for quantification, an obvious problem with B is that averaging the scores across different classes brings about unintuitive results, since the positive bias for one class and the negative bias for another class cancel each other out. The same problem occurs when sticking to the same class but averaging across different samples.

As a result, this measure can at most be used to determine if a method has a tendency to underestimate or overestimate the prevalence of a specific class (typically: the minority class) in BQ, and not as an evaluation measure for general use.

## *3.1.3 Absolute Error and its Variants*

*Absolute Error* (AE), defined as

$$\text{AE}(p,\hat{p}) = \frac{1}{|\mathcal{Y}|} \sum\_{\mathbf{y} \in \mathcal{Y}} |\hat{p}(\mathbf{y}) - p(\mathbf{y})| \tag{3.2}$$

is similar, but enforces the notion that positive and negative bias are equally undesirable. As a result, averaging it across several classes, or several samples, is not problematic.

As shown in Sebastiani (2020), AE enjoys **IMP** and **ABS** but does not enjoy **MAX** (and, since it enjoys **ABS**, does not enjoy **REL** either), since AE ranges between 0 (best) and

$$z\_{\rm AE} = \frac{2(1 - \min\_{\mathbf{y} \in \mathcal{Y}} p(\mathbf{y}))}{|\mathcal{Y}|} \tag{3.3}$$

(worst), i.e., its range depends on the true distribution *p* and on the cardinality of *Y*.

If viewed as a generic function of dissimilarity between vectors (and not just probability distributions), AE is nothing else than the well-known "city-block distance" normalised by the number of classes. Note that AE often goes by the name of *Mean* Absolute Error; for simplicity, for this and the other measures we discuss in the rest of this book we will omit the qualification "Mean", since every measure mediates across the class-specific values in its own way. Some recent papers Beijbom et al. (2015); González et al. (2017) that tackle quantification in the context of ecological modelling discuss or use, as an evaluation measure for quantification, *Bray-Curtis dissimilarity* (BCD), a measure popular in ecology for measuring the dissimilarity of two samples. However, when used to measure the dissimilarity of two probability distributions, BCD defaults to AE; as a result we will not analyse BCD any further.

*Normalised Absolute Error* (NAE), defined as

$$\text{NAE}(p,\hat{p}) = \frac{\text{AE}(p,\hat{p})}{\text{zAE}} = \frac{\sum\_{\mathbf{y} \in \mathcal{Y}} |\hat{p}(\mathbf{y}) - p(\mathbf{y})|}{2(1 - \min\_{\mathbf{y} \in \mathcal{Y}} p(\mathbf{y}))} \tag{3.4}$$

is a version of AE that always ranges between 0 (best) and 1 (worst), and thus enjoys **MAX**. However, NAE does not enjoy **ABS** (while at the same time not enjoying **REL** either).

A slight variant of absolute error is *Squared Error* (SE), defined as

$$\text{SE}(p,\hat{p}) = \frac{1}{|\mathcal{Y}|} \sum\_{\mathbf{y} \in \mathcal{Y}} (\hat{p}(\mathbf{y}) - p(\mathbf{y}))^2 \tag{3.5}$$

It obviously shares the same pros and cons of AE, and we will not discuss it any further.

For AE and for all the other evaluation measures for quantification discussed in this book, Table 3.1 (reproduced from Sebastiani (2020)) lists the papers where the measure has been proposed and those which have subsequently used it for evaluation purposes.

## *3.1.4 Relative Absolute Error and its Variants*

*Relative Absolute Error* (RAE), defined as

$$\text{RAE}(p,\hat{p}) = \frac{1}{|\mathcal{Y}|} \sum\_{\mathbf{y} \in \mathcal{Y}} \frac{|\hat{p}(\mathbf{y}) - p(\mathbf{y})|}{p(\mathbf{y})} \tag{3.6}$$

is a refinement of AE that enforces **REL** by making AE relative to true class prevalence. RAE enjoys **IMP** and **REL** but does not enjoy **MAX** and (obviously)


**Table 3.1** Research works about quantification where the evaluation measures for quantification discussed in this book have been first proposed (-) and later used ().

(continued)


#### **Table 3.1** (continued)

#### **ABS**. It does not enjoy **MAX** because it ranges between 0 (best) and

$$|\mathcal{Y}| - 1 + \frac{1 - \min\_{\mathbf{y} \in \mathcal{Y}} p(\mathbf{y})}{\min\_{\mathbf{y} \in \mathcal{Y}} p(\mathbf{y})}$$

$$z\_{\text{RAE}} = \frac{1}{|\mathcal{Y}|} \tag{3.7}$$

(worst), i.e., its range depends on the true distribution *p* and on the cardinality of *Y*. *Normalised Relative Absolute Error* (NRAE), a version of RAE that ranges between 0 (best) and 1 (worst), can thus be obtained as

$$\text{NRAE}(p,\hat{p}) = \frac{\text{RAE}(p,\hat{p})}{\text{zRAE}} = \frac{\sum\_{\mathbf{y} \in \mathcal{Y}} \frac{|\hat{p}(\mathbf{y}) - p(\mathbf{y})|}{p(\mathbf{y})}}{1 - \min\_{\mathbf{y} \in \mathcal{Y}} p(\mathbf{y})} \tag{3.8}$$

However, it can be shown that NRAE does not enjoy **REL** (and does not enjoy **ABS** either), so its name "Normalised Relative Absolute Error" is somehow a misnomer.

Note that both RAE and NRAE may be undefined due to the presence of zero denominators. To solve this problem, in computing RAE and NRAE we can smooth both *p(y)* and *p(y)* ˆ via additive smoothing, i.e., we take

$$\underline{p}(\mathbf{y}) = \frac{\epsilon + p(\mathbf{y})}{\epsilon |\mathcal{Y}| + \sum\_{\mathbf{y} \in \mathcal{Y}} p(\mathbf{y})} \tag{3.9}$$

where *p(y)* denotes the smoothed version of *p(y)* and the denominator is just a normalising factor (same for the *<sup>p</sup>*ˆ*(y)*'s); the quantity <sup>=</sup> <sup>1</sup> <sup>2</sup>|*U*<sup>|</sup> is often used as a smoothing factor. The smoothed versions of *p(y)* and *p(y)* ˆ are then used in place of their original non-smoothed versions in Equations 3.6 and 3.8; as a result, RAE and NRAE are always defined. The same method will also be used for all other measures that may incur in the problem of zero denominators (see e.g., Equation 3.10), and that we will encounter in the next sections.

## *3.1.5 Kullback-Leibler Divergence and its Variants*

Forman (2005) proposes to evaluate SLQ by means of *normalised cross-entropy*, better known as *Kullback-Leibler Divergence* (KLD). KLD, defined as

$$\text{KLD}(p,\hat{p}) = \sum\_{\mathbf{y} \in \mathcal{Y}} p(\mathbf{y}) \log \frac{p(\mathbf{y})}{\hat{p}(\mathbf{y})} \tag{3.10}$$

ranges between 0 (best) and +∞ (worst). KLD is widely used as an evaluation measure for SLQ, and it has also been adopted as the official evaluation measure of the only quantification-related shared task that has been organised so far, Subtask D "Tweet Quantification on a 2-point Scale" of SemEval-2016 and SemEval-2017 "Task 4: Sentiment Analysis in Twitter" (Nakov et al., 2016, 2017).

The fact that KLD is not upper-bounded means that it does not satisfy **MAX**. 4 *Normalised Kullback-Leibler Divergence* (NKLD), defined as

$$\text{NKLD}(p,\hat{p}) = 2\frac{e^{\text{KLD}(p,\hat{p})}}{e^{\text{KLD}(p,\hat{p})} + 1} - 1\tag{3.11}$$

is a variant of KLD that does enjoy **MAX**, since it ranges between 0 (best) and 1 (worst). Unfortunately, as shown in Sebastiani (2020), both KLD and NKLD enjoy

<sup>4</sup> Actually, the fact that smoothing is used makes KLD upper-bounded, but by a factor that depends on both *p* and *Y*, which means that KLD does not satisfy **MAX** anyway. See Sebastiani (2020, §4.7) for details.

none of **IMP**, **REL** and **ABS**, which makes their use as evaluation measures for quantification questionable.

A further problem of KLD and NKLD is that they score low in terms of understandability, i.e., look esoteric to the mathematically uninitiated, at least when compared to the much easier-to-grasp AE ad RAE. A second problem is that their typical values are usually difficult to make sense of, since genuinely engineered quantifiers may easily obtain values in [10−6*,* <sup>10</sup>−2].

A third, related problem is that realistic quantifiers trained by genuinely engineered quantification methods may obtain values that are different by orders of magnitude, which is something that experimenters may find difficult to interpret. As an example, assume a (very realistic) scenario in which |*σ*| = 1000, *Y* = {*y*1*, y*2}, *p(y*1*)* = 0*.*01, and in which three different quantifiers *p*ˆ , *p*ˆ , *p*ˆ are such that *p*ˆ *(y*1*)* = 0*.*0101, *p*ˆ *(y*1*)* = 0*.*0110, *p*ˆ *(y*1*)* = 0*.*0200. In this scenario KLD ranges on [0*,* 7*.*46], KLD*(p, p*ˆ *)* = 4*.*78e-07, KLD*(p, p*ˆ *)* = 4*.*53e-05, KLD*(p, p*ˆ *)* = 3*.*02e-03, i.e., the difference between KLD*(p, p*ˆ *)* and KLD*(p, p*ˆ *)* and the difference between KLD*(p, p*ˆ *)* and KLD*(p, p*ˆ *)* are 2 orders of magnitude each, while the difference between KLD*(p, p*ˆ *)* and KLD*(p, p*ˆ *)* is no less than 4 orders of magnitude. The increase in error (as computed by KLD) deriving from using *p*ˆ instead of *p*ˆ is +632599%. We should add that, if (as noted at the beginning of Section 3.1) one wanted to average KLD results across a set of samples, the average would be completely dominated by the value with the highest order of magnitude, and the others would have little or no impact.

Unfortunately, switching from KLD to NKLD does not help much in this respect since, for realistic quantification systems, NKLD*(p, p)*<sup>ˆ</sup> <sup>≈</sup> <sup>1</sup> <sup>2</sup> KLD*(p, p)*ˆ . The reason is that NKLD is obtained by applying a sigmoidal function (namely, the logistic function) to KLD, and the tangent to this sigmoid for *<sup>x</sup>* <sup>=</sup> <sup>0</sup> is *<sup>y</sup>* <sup>=</sup> <sup>1</sup> <sup>2</sup> *x*; since the values of KLD for realistic quantifiers are (as we have observed above) very close to 0, for these values the NKLD*(p, p)*ˆ curve is well approximated by *<sup>y</sup>* <sup>=</sup> <sup>1</sup> <sup>2</sup> KLD*(p, p)*ˆ . As a measure for evaluating SLQ, NKLD thus *de facto* inherits most of the problems of KLD *.*

## *3.1.6 Which Measure is the Best for SLQ?*

Figure 3.1 (adapted from Sebastiani (2020)) plots the six main evaluation measures discussed in Sections 3.1.3 to 3.1.5 for the binary case. Table 3.2 summarises instead, in compact form, the properties that these measures enjoy. From this table it appears evident that no measure proposed so far is completely satisfactory. Which measure should one adopt then?

KLD and NKLD are the least satisfactory ones, and seem out of the question. Concerning the others, the problem is that **MAX** seems to be incompatible with

**Fig. 3.1** 2D plots and 3D plots (for a binary quantification task) for the six main evaluation measures mentioned in Sections 3.1.3 to 3.1.5; *p(y*1*)* and *p(y*2*)* are represented as *x* and *(*1 − *x)*, respectively, while *p(y* ˆ <sup>1</sup>*)* and *p(y* ˆ <sup>2</sup>*)* are represented as *y* and *(*1−*y)*. Darker areas represent values closer to 0 (i.e., smaller error) while lighter areas represent values more distant from 0 (i.e., higher error).


**REL** / **ABS**, and vice versa. In order to break the deadlock, it is important to remember that


Sebastiani (2020, §5.1) contends that Arguments 1 and 2 seem more important than Argument 3, since they are really about how an evaluation measure reflects the needs of the application; if the corresponding properties are not satisfied, one may argue that the quantification accuracy (or error) being measured is only loosely related to what the user really wants. Argument 3, while important, only implies that, if **MAX** is not satisfied, (1) results obtained on codeframes of different cardinality will not be comparable, and (2) results obtained on samples characterised by different true distributions will not be comparable. Despite this, results obtained by different systems on the same set of samples, even if this set contains samples that refer to codeframes of different cardinality, remain comparable.

This suggests that AE and RAE (or their "squared" versions, such as the SE measure of Section 3.1.3) are the measures of choice; AE should be preferred when an estimation error of a given absolute magnitude should be considered more serious when the true prevalence of the affected class is lower, while RAE should be chosen when an estimation error of a given absolute magnitude has the same impact independently of the true prevalence of the affected class.

## **3.2 Measures for Evaluating OQ**

## *3.2.1 Earth Mover's Distance*

The most popular measure for evaluating *ordinal quantification* systems is currently the *Earth Mover's Distance* (EMD – Rubner et al., 1998). EMD, also known as the *Vaser˘ste˘ın metric* (Rüschendorf, 2001), is a function often used in content-based image retrieval for computing the distance between colour distributions of two images (see Levina and Bickel, 2001 for a rigorous probabilistic interpretation of the EMD). It was first proposed as an evaluation measure for ordinal quantification in Esuli and Sebastiani (2010b), and was used as the official evaluation measure of Subtask E "Tweet Quantification on a 5-point Scale" of SemEval-2016 and SemEval-2017 "Task 4: Sentiment Analysis in Twitter" (Nakov et al., 2016, 2017).

To see the intuition upon which the EMD is based, if the two distributions are interpreted as two different ways of scattering a certain amount of "earth" across different "heaps", the EMD is defined to be the minimum amount of work needed for transforming one set of heaps into the other, where the work is assumed to correspond to the sum of the amounts of earth moved times the distance by which they are moved. EMD may be seen as computing the minimal "cost" incurred in transforming one distribution into the other, where the cost is computed as the probability mass that needs to be shuffled around from one class to another, weighted by the "distance" between the classes involved.

Originally, the EMD is defined for the general case in which a distance *d(y , y )* is defined on *<sup>Y</sup>*2. In the much more specific case in which (a) there is a total order *y*<sup>1</sup> ≺ *...* ≺ *y*|*Y*<sup>|</sup> on the classes in *Y*, and (b) *d(yi, yj )* = *d(yi, yi*+1*)* + *d(yi*+<sup>1</sup>*, yi*+2*)* + *...* + *d(yj*−<sup>2</sup>*, yj*−1*)* + *d(yj*−<sup>1</sup>*, yj )* for all 1 ≤ *i<j* ≤ |*Y*| (as is the case in ordinal quantification), EMD comes down to the *Normalised Match Distance* (NMD) (Sakai, 2018; Werman et al., 1985), defined as

$$\text{NMD}(p,\hat{p}) = \frac{1}{|\mathcal{Y}| - 1} \sum\_{j=1}^{|\mathcal{Y}| - 1} d(\mathbf{y}\_j, \mathbf{y}\_{(j+1)}) \cdot |\sum\_{l=1}^{j} \hat{p}(\mathbf{y}\_l) - \sum\_{l=1}^{j} p(\mathbf{y}\_l)| \tag{3.12}$$

where <sup>1</sup> <sup>|</sup>*Y*|−<sup>1</sup> is just a normalisation factor that allows NMD to range between 0 (best) and 1 (worst).

The rationale of Equation 3.12 is the following. Assume that, in line with the interpretation of the NMD we have given above, in order to transform the estimated distribution *p*ˆ into the true distribution *p* we need to move some estimated probability across classes, from the ones where prevalence has been overestimated to the ones where prevalence has been underestimated. We formalise this by saying that if *p(yi)* has been overestimated there is going to be a positive quantity *(p(y* ˆ *i)*− *p(yi))* outgoing from class *yi*, while if *p(yi)* has been underestimated this outgoing quantity is negative (which means that the incoming quantity is positive). In order to minimise the travelled distance, it makes sense to transfer probability mass from classes that are next to each other in the total order. The first step, from left to right, is thus to transfer | ˆ*p(y*1*)* − *p(y*1*)*| from *y*<sup>1</sup> to *y*<sup>2</sup> if *(p(y* ˆ <sup>1</sup>*)* − *p(y*1*))* is positive, or to transfer | ˆ*p(y*1*)* − *p(y*1*)*| from *y*<sup>2</sup> to *y*<sup>1</sup> if it is negative; in either case the cost of this transfer is *d(y*1*, y*2*)* ·|ˆ*p(y*1*)* − *p(y*1*)*|. Since *p(y* ˆ <sup>1</sup>*)* has now been transformed into *p(y*1*)*, the next step is to transfer probability mass from *y*<sup>2</sup> to *y*3. The probability mass outgoing from *y*<sup>2</sup> is now *(p(y* ˆ <sup>1</sup>*)*+ ˆ*p(y*2*))*−*(p(y*1*)*+*p(y*2*))*, which is going to be positive if *y*<sup>1</sup> and *y*<sup>2</sup> have altogether been overestimated, and negative otherwise; in either case the cost of this transfer is *d(y*2*, y*3*)*·|*(p(y* ˆ <sup>1</sup>*)*+ ˆ*p(y*2*))*−*(p(y*1*)*+*p(y*2*))*|. Proceeding in this fashion, |*Y*| − 1 probability mass transfers are performed, which explains Equation 3.12; this also shows that, in the form of Equation 3.12, NMD can be computed in |*Y*| − 1 steps from the estimated and true class prevalence values.

Note that in many practical cases it happens that *d(yi, yi*+1*)* = 1 for all *i* ∈ {1*,...,(*|*Y*| − 1*)*}, which means that Equation 3.12 simplifies even further.

NMD can be seen as the ordinal equivalent of absolute error; in fact, assuming that *d(y , y )* is the same for all *y , y* <sup>∈</sup> *<sup>Y</sup>*<sup>2</sup> (in which case ordinal quantification defaults to standard single-label quantification), the probability mass that needs to be moved from one class to another, weighted by the distance between the two classes (though this weighting is inessential, as all inter-class distances are the same) in order to recover *p* from *p*ˆ, is exactly absolute error.

## *3.2.2 Root Normalised Order-Aware Divergence*

Another proposed measure for evaluating the quality of OQ estimates is the *Root Normalised Order-aware Divergence* (RNOD), proposed by Sakai (2018) and defined as

$$\text{RNOD}(p,\hat{p}) = \left(\frac{\sum\_{\mathbf{y}\_{i}\in\mathcal{Y}^{\*}}\sum\_{\mathbf{y}\_{j}\in\mathcal{Y}}d(\mathbf{y}\_{j},\mathbf{y}\_{i})(p(\mathbf{y}\_{j})-\hat{p}(\mathbf{y}\_{j}))^{2}}{|\mathcal{Y}^{\*}|(|\mathcal{Y}|-1)}\right)^{\frac{1}{2}}\tag{3.13}$$

where *Y*<sup>∗</sup> = {*yi* ∈ *Y*|*p(yi) >* 0}.

However, RNOD is a more controversial measure for OQ than NMD since, without making it explicit, it penalizes more heavily mistakes (i.e., "transfers" of probability mass from a class to another) closer to the extremes of the codeframe. For instance, given codeframe *Y* = {*y*1*, y*3*, y*3*, y*4*, y*5}, assume that the true distribution is *p* = *(*0*.*2*,* 0*.*2*,* 0*.*2*,* 0*.*2*,* 0*.*2*)*, and assume two predicted distributions *p*ˆ = *(*0*.*2*,* 0*.*2*,* 0*.*3*,* 0*.*1*,* 0*.*2*)* and *p*ˆ = *(*0*.*2*,* 0*.*2*,* 0*.*2*,* 0*.*3*,* 0*.*1*)*. The two predicted distributions make essentially the same mistake, i.e., erroneously "transfer" a probability mass of 0.1 from a class *yi* to a class *y(i*−1*)*, the difference being that in *p*ˆ it is the case that *i* = 4 and in *p*ˆ it is the case that *i* = 5. NMD penalizes them equally (since NMD*(p, p*ˆ *)* = NMD*(p, p*ˆ *)* = 0*.*1). RNOD instead does not (since RNOD*(p, p*ˆ *)* ≈ 0*.*080 while RNOD*(p, p*ˆ *)* ≈ 0*.*092), and the degree to which mistakes closer to the extremes of the codeframe are penalised more heavily than the ones close to the center of the codeframe, is not explicit in the formula.

Other OQ evaluation measures are proposed by Sakai (2021), such as *Root Symmetric Normalised Order-aware Divergence* (RSNOD) and *Root Normalised Average Distance-Weighted sum of squares* (RNADW), but we do not consider them here since they are variants of RNOD that share the characteristics of RNOD mentioned above.

## **3.3 Measures for Evaluating Regression Quantification**

The only work to date that investigates *regression quantification* (RQ) is Bella et al. (2014); this work (discussed in details in Section 5.2) is thus also the only one that discusses how to perform evaluation for this task.

In a regression problem, every input object is assigned with a real-valued score as output, differently from the classification case in which the output has a categorical form. For the regression quantification scenario, Bella et al. (2014) identify two possible quantification goals, aggregated indicator estimation and distribution estimation.

The case of estimating an aggregated indicator is defined as the one in which the interest is on estimating a statistic function I that summarises some property of the distribution on regression scores over the unlabelled set *U*. A typical example of indicator function in regression quantification is the average of the regression values of the elements, i.e.,

$$\begin{aligned} \mathbb{I}(U) &= \mu\_U \\ &= \frac{1}{|U|} \sum\_{l=1}^{|U|} \mathbf{y}\_l \end{aligned} \tag{3.14}$$

Bella et al. (2014) observe that the single numerical value produced by an aggregated indicator can be compared to the true value from test data using an error measure such as the Squared Error (see Section 3.1.3).

The optimal value of SE is always zero, while the upper bound of this measure depends on the range of regression values and distribution of data. Bella et al. (2014) propose a variant of SE, VSE, that normalises the SE value by the variance of the training set, i.e.,

$$\text{VSE}(p, \hat{p}, L) = \frac{\text{SE}(p, \hat{p})}{\text{Var}(L)} \tag{3.15}$$

where Var*(L)* is the variance of the true regression scores computed on elements in the training set. The motivation the authors give to support VSE is to make the results from experiments less dependent on the magnitude range of the data when such experiments are run on different datasets, or involve repeated runs.

Bella et al. (2014) do not provide a theoretical motivation to support VSE, in particular on why the variance should be computed on the training set, instead of the test set. In experiments using different test sets, either from different datasets or produced by sampling, the variance of the test set has the same ability to measure the difference in the magnitude range of values. In experiments using the same test set, a training set with higher variance get an advantage when evaluated using VSE. The intuition the authors followed may be that learning from a training set with higher variance is more difficult than learning from a low variance one. Yet, this does not take into account the actual variance of the test set. It could be the case that also the test set has a high variance, and the lower variance training set is thus the one from which is more difficult to learn an accurate regressor. Moreover, it is possible to cheat VSE by adding dummy examples in training set with the sole aim of increasing variance.

The case of estimating a probability distribution over the regression values can be evaluated comparing distribution of the true values with the one of predicted values. This is an intermediate step in the complexity of prediction between accurately predicting each single value, i.e., the actual regression problem, and predicting an aggregated estimator.

A way to compare probability distributions is to use divergence measures, yet Bella et al. (2014) observe that some divergence measures are not always defined when comparing empirical distributions5. and they thus suggest to perform the evaluation using the cumulative distributions. Within the set of measures that compare cumulative distributions they mention the Kolmogorov-Smirnov statistic measures, but they criticise the fact that it only considers the point where the distributions differ the most, when the entire shape of the distributions should be considered instead. For this reason they thus suggest to adopt, as a better refined evaluation measure, the *Cramér-von-Mises statistic* (Anderson, 1962) that computes an integral between the difference of the two cumulative distributions. More specifically, they adopt in their experiments the *L1*-version of the statistic (Xiao et al., 2006).

## **3.4 Experimental Protocols for Evaluating Quantification**

Any test set routinely used for testing the accuracy of classification can obviously be used also for evaluating quantification. However, the problem is that, while for classification a set of *k* unlabelled data provides *k* unlabelled data points,

<sup>5</sup> E.g., Kullback-Leibler divergence requires that *p(x)* <sup>=</sup> <sup>0</sup> ⇒ ˆ*p(x)* <sup>=</sup> 0. KL is thus undefined when the regressor predicts even a single value that is not within the set of values appearing in the test set.

for quantification the same test set just provides 1 test data point. Evaluating quantification algorithms is thus challenging, due to the fact that the availability of labelled data for testing purposes is not unlimited.

There are two main experimental protocols that have been taken in order to deal with this problem; we will here call them the *Natural-Prevalence Protocol* (NPP) and the *Artificial-Prevalence Protocol* (APP).

# *3.4.1 Natural Prevalence Protocol (NPP)*

The NPP was first used by Esuli and Sebastiani (2015). It consists of taking a large enough test set, partitioning it in a number of samples, and carrying out the evaluation individually on each such sample. For instance, Esuli and Sebastiani (2015) tested binary quantifiers on the well-known RCV1-v2 text classification dataset, whose test set consists of about 780,000 news items issued by the Reuters news agency over a period of 52 weeks, and labelled with 99 different classes. This allowed the authors to split the test set in 52 samples (each corresponding to a week), each of which provided 1 testing data point for 99 different BQ experiments, thus generating 52 × 99 = 5148 testing data points.

## *3.4.2 Artificial Prevalence Protocol (APP)*

The APP was first used by Saerens et al. (2002). This protocol consists of taking a standard dataset, split into a training set *L* and a set *U* of unlabelled items, and conducting repeated experiments in which either the training set prevalence or the test set prevalence of a class are artificially varied via subsampling. For instance, in the BQ experiments carried out by Forman (2005), given codeframe *Y* = {⊕*,* }, repeated experiments are conducted in which either examples of ⊕ or examples of are removed at random from the test set in order to generate a predetermined prevalence of ⊕ in the sample *U* thus obtained. In this way, different samples can be generated, each characterised by a different prevalence of ⊕ (e.g., *pU (*⊕*)* ∈ {0*.*00*,* 0*.*05*,...,* 0*.*95*,* 1*.*00}). This can be repeated, thus generating multiple random samples for each class prevalence. Analogously, random removal of either positive or negative examples can be performed on the training set, thus bringing about training sets with different values of *p(*⊕*)*. Example results of the application of the APP will be illustrated in Section 6.3.

Doing an analogous grid-based exploration in the SLQ setting is certainly possible, but cumbersome; for instance, if we want to restrict ourselves to class prevalence values in the set {0*.*00*,* 0*.*05*,...,* 0*.*95*,* 1*.*00}, there are just 21 possible distributions in the BQ case, but in the SLQ case there are many, many more, especially when |*Y*| is high, due to combinatorial explosion. If we use a grid of class prevalence values *<sup>g</sup>* = { *<sup>i</sup> m*} *m <sup>i</sup>*=<sup>0</sup> containing <sup>|</sup>*g*| = *<sup>m</sup>* <sup>+</sup> <sup>1</sup> possible values (with *m* an integer), and we have |*Y*| = 2 classes, then there are *(m* + 1*)* choices for *pσ (y*1*)*; of course, these constrain the value of *pσ (y*2*)*, which must be equal to *pσ (y*2*)* = 1 − *pσ (y*1*)*.

Let us define a function *K(m, n)* that computes the number of possible combinations for |*Y*| = *n* classes using a grid of prevalence values, from 0 to 1 at a step size of <sup>1</sup> *<sup>m</sup>* . For the binary case discussed above, it is the case that *K(m,* 2*)* = *m* + 1. For the ternary case, i.e., when *n* = 3, we have *K(m,* 3*)* = *(m* + 1*)(m* + 2*)/*2. This follows from the observation that, when we set *pσ (y*1*)* = 0*/m*, there are *m*+1 possible choices for *pσ (y*2*)* (while *pσ (y*3*)* = 1−*(pσ (y*1*)*+*pσ (y*2*))* is constrained); when we set *pσ (y*1*)* = 1*/m*, there only exist *m* possible choices for *pσ (y*2*)*; and so on, until we end up setting *pσ (y*1*)* = *m/m* = 1, for which there is only one possible combination *pσ (y*2*)* = *pσ (y*3*)* = 0 representing a valid distribution. In our previous example, with |*g*| = 21, we thus have *K(*20*,* 3*)* = 231. In general, for arbitrary *m* and *n* values, the number of possible prevalence distributions can be derived from the so-called "stars and bars" method<sup>6</sup> and is given by

$$K(m,n) = \binom{m+n-1}{n-1} \tag{3.16}$$

## *3.4.3 A Variant of the APP Based on the Kraemer Algorithm*

As one would expect by looking at Equation 3.16, the number of possible distribution vectors that the APP generates grows very rapidly. To exemplify, for 5 classes we already reach *K(*20*,* 5*)* = 10,626 valid combinations, while for 10 classes the number of combinations rises to *K(*20*,* 10*)* = 10,015,005. Things get even worse when using a finer-grained grid; for example, using a stepsize of 0.01 (i.e., setting *m* = 100) the number of combinations to explore for 10 classes, *K(*100*,* 10*) >* 4*E*12, becomes impractical.

One possible solution consists of simply renouncing to *predetermine* class prevalence values, and instead letting them vary at random, by first generating a random distribution *p* and then generating a sample *σ* by randomly picking items

<sup>6</sup> A probability distribution of *<sup>n</sup>* classes taking prevalence values from a grid *<sup>g</sup>* of *(m*+1*)* prevalence values of probability mass 1*/m* each, can be seen as a vector of *(m*+*n*−1*)* positions filled with *m* "stars" (each star representing a probability mass of 1*/m*) and *(n*−1*)* "bars" (each bar representing a separator for two adjacent classes). For example, for *n* = 4 and *m* = 10, the probability distribution given by *pσ (y*0*)* = 0*.*3, *pσ (y*1*)* = 0, *pσ (y*2*)* = 0*.*6, and *pσ (y*3*)* = 0*.*1, corresponds to the vector of "stars and bars" *(*∗*,* ∗*,* ∗*,*|*,*|*,* ∗*,* ∗*,* ∗*,* ∗*,* ∗*,* ∗*,*|*,* ∗*)*, where each '∗' amounts to 0*.*1 of probability mass, and there are *(n* − 1*)* separators '|'. The number of ways *(n* − 1*)* bars (resp., *m* stars) can be inserted in a vector of *(m* + *n* − 1*)* positions, with the remaining elements set to stars (resp., bars) is given by the binomial coefficient above. See https://brilliant.org/wiki/integerequations-star-and-bars/#stars-and-bars for further details.

from the population according to *p*. Somehow unexpectedly though, sampling distribution vectors *p uniformly* at random, i.e., so that all legitimate distribution vectors are equally likely, is not a trivial task. An intuitive and straightforward procedure, consisting of drawing *n* values, uniformly at random from the [0*,* 1] interval, and then normalizing each value by the sum of all values (the sampling method used in Esuli et al. (2021)), corresponds to a sampling distribution that is strongly biased towards the centre of the distribution, for reasons that are discussed by Smith and Tromble (2004). Luckily enough, Smith and Tromble (2004) presented also a correct sampling algorithm, called the *Kraemer algorithm*, for sampling the unit *(n* <sup>−</sup> <sup>1</sup>*)*-simplex<sup>7</sup> uniformly. Given a set *<sup>n</sup>* of classes, the method works as follows:


The Kraemer sampling algorithm has two additional advantages with respect to sampling based on a predefined grid: (i) it allows the practitioner to draw a desired number of samples, instead of imposing to generate all *K(m, n)* valid combinations from the grid of prevalence values; and (ii) it truly allows any possible distribution vector to be picked, while this is not possible when using a grid of values, and especially so when the grid is a coarse-grained one.

To the best of our knowledge, the first experimental setting in the quantification literature that adopts the Kraemer algorithm as the sample-generating function is the one described in Esuli et al. (2022). Since this work is very recent, the version of the APP that is generally used by the quantification community is the "grid-based" version that we have discussed above. It remains to be see whether the version that adopts the Kraemer algorithm will gain significant acceptance in the years to come.

## *3.4.4 Should we Use the NPP or the APP?*

The APP is much more widely used than the NPP in the quantification literature, possibly due to the difficulty of finding the large enough test sets that the NPP requires. However, both protocols have different pros and cons. One advantage of the APP (and a corresponding disadvantage of the NPP) is that it allows many test data points to be created from the same test set; it is not always the case that test sets large enough for the NPP to be adopted (such as the above-mentioned RCV1-v2)

<sup>7</sup> A distribution vector *<sup>p</sup>* <sup>=</sup> *(pσ (y*1*), . . . , pσ (yn))* belongs to the unit *(n* <sup>−</sup> <sup>1</sup>*)*-simplex since *pσ (yi)* ∈ [0*,* 1] for all *yi* ∈ *Y* and since *yi*∈*<sup>Y</sup> pσ (yi)* <sup>=</sup> 1.

are available, so, when a smaller test set is all we have, the APP allows generating test data points almost at will. Additionally, the APP allows many situations (i.e., different training class prevalence values, different test class prevalence values, different amounts of shift, . . . ) to be simulated; in such a way, one can test the robustness of a quantification algorithm on many conditions even if the dataset itself does not naturally exhibit such conditions. However, one disadvantage of the APP is that it is not clear how realistic these different situations are; e.g., if *p(y)* in the test set is 0.05, testing a quantifier on a sample *U* extracted from it such that *pU (y)* = 0*.*95 might be challenging but unrealistic, since these amounts of shift may be unlikely in real-world applications. The NPP, by focusing on real samples and really occurring situations, scores higher than the APP in terms of realism.

It may be worth noting that some of the problems discussed above might be solved by defining a protocol "intermediate" between the NPP and the APP, i.e., a protocol which uses prior knowledge about the distribution of "likely" prevalence vectors that one could expect to encounter in the specific domain at hand. However, we are unaware of previous experiments that used this or similar approaches, likely due to the fact that, in real scenarios, it is difficult to have any such prior knowledge about how the distribution might vary. Anyway, the bottom line is that the pros and cons of the APP and the NPP have led to some controversy around the adequacy of these protocols in the assessment of the performance of quantification systems. Hassan et al. (2021) expose the shortcomings and potential risks that the adoption of the APP (more specifically: of the grid-based variant discussed in Section 3.4.2) might bring about in the evaluation. Among other things, the authors reported that knowing in advance the expected value of the distribution vectors that the APP generates (e.g., that in binary quantification the positive class has an expected prevalence value of <sup>E</sup>[*pσ (*⊕*)*] = <sup>0</sup>*.*5) might be maliciously exploited in order to get an (illusory) advantage over other methods that do not make such assumption.

It should also be mentioned that the APP, as it has been used up to now, only models either covariate shift or prior probability shift, and does not model concept shift. To see this, assume we are dealing (see Section 1.5) with a "*X* → *Y*" problem, which is modelled by Equation 1.3. By subsampling the test set we are simulating covariate shift, whereby *p(y)* changes only because *p(***x***)* changes (since we have selectively removed specific data items **x**); note that *p(y*|**x***)* does not change, i.e., there is no concept shift, since the labels of the items that have not been removed have remained the same. Concept shift could be simulated not by removing labelled items but by flipping the labels on some data items, e.g., according to one of the two methods discussed in Esuli and Sebastiani (2013). With this method, given a set *U* of unlabelled items, many samples that contain the very same data items (which means there is no covariate shift) could be generated by flipping different subsets of the items contained in *U*. Conversely, assume we are dealing with a "*Y* → *X*" problem, which is modelled by Equation 1.4. By subsampling the test set we are simulating prior probability shift, whereby *p(y)* changes *motu proprio* (since we have selectively removed data items characterised by a specific label *y*); note that *p(***x**|*y)* does not change, i.e., there is no shift in the within-class densities, since the feature vectors of the items that have not been removed have remained the same.

## **3.5 Model Selection in Quantification**

The performance of many machine learning algorithms depends on how their *hyperparameters* are set. Hyperparameters control specific aspects of the learning process and, in contrast to the *parameters* of the model, they are not learned during the training phase, but are instead set in advance.

Although machine learning methods often come with default values for the hyperparameters (values that the inventors of the method have found to work reasonably well in a variety of scenarios), it is well known that the final performance can often be improved by carefully tuning the hyperparameters for the specific applicative domain. Quantification systems are no exception in this regard.

The process of hyperparameter optimisation is known as *model selection*, and typically consists of testing how well the model fares when setting the hyperparameters with different combinations of values from a set of candidate configurations. Model selection is carried out in a fully automated way, i.e., the model's performance is assessed on held-out validation data or via cross-validation.

Model selection is thus inherently related to performance evaluation. Hyperparameter optimisation should thus mimic the evaluation protocol (using validation data) when assessing the adequacy of each of the candidate configurations. Since quantification has become a task on its own right, with dedicated evaluation measures (Sections 3.1, 3.2, and 3.3) and dedicated experimental protocols (Section 3.4), it should likewise have specific model selection routines (Moreo and Sebastiani, 2021). In other words, since the goal of model selection is to choose the configuration of hyperparameters that perform best according to a given experimental protocol and a given evaluation measure, it makes perfect sense to adopt the same evaluation protocols and error metrics customarily used in the evaluation of quantification systems.

Somehow surprisingly, though, the quantification community has largely overlooked this aspect in the past. In a large body of quantification work, it is not even documented whether the hyperparameters were optimised at all (Esuli and Sebastiani, 2014; Forman, 2008; González et al., 2017; González-Castro et al., 2013; Hopkins and King, 2010; Levin and Roitman, 2017; Pérez-Gállego et al., 2017; Saerens et al., 2002). Other papers Barranquero et al. (2015); Bella et al. (2010); Esuli and Sebastiani (2015); Hassan et al. (2020); Milli et al. (2013) simply report that the hyperparameters were left to their default values; others do not document the evaluation measure being optimised during model selection (Esuli et al., 2020; Gao and Sebastiani, 2016), or instead optimise for a classificationoriented loss (Barranquero et al., 2013; Pérez-Gállego et al., 2019).

The only paper we are aware of that proposed the use of a quantification-oriented optimisation of hyperparameters is Moreo and Sebastiani (2021). In their work, the authors claimed that the "Classify and Count" (CC – Section 4.2.1) method and its variants (Sections 4.2.2, 4.2.3, 4.2.4), routinely used as baseline models in experimental evaluations, have largely been misrepresented, since they have never been optimised properly for the task of quantification. In their results they showed that, when properly optimised, these simple method become respectable contenders, even if still inferior to the most sophisticated quantification methods.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 4 Methods for Learning to Quantify**

This section is devoted to discussing methods that have been proposed in the literature for tackling quantification1. All of these methods rely on supervised learning, and depart from standard classification methods in one or more ways.

As in the rest of this book, our main focus is (for the reasons discussed in Section 1.4) single-label multiclass quantification. While many of the methods that will be discussed in this section can natively deal with the single-label multiclass case, some other methods (for example, those of Sections 4.2.5, 4.2.12 and 4.3.1) are only defined for the binary case, and cannot easily be extended to the singlelabel multiclass case. In order to use them for single-label multiclass quantification, it is thus necessary to run them in binary mode for each class in the codeframe, and to normalise the resulting class prevalence values so that they sum to 1.

Broadly speaking, two large classes of methods can be discerned in the literature.

The first class is that of *aggregative methods*, i.e., methods that require the classification of all the individual data items as an intermediate step; these methods will be the subject of Sections 4.2 and 4.3. Within the class of aggregative methods, two subclasses can be identified. The first subclass (Section 4.2) includes methods based on general-purpose learners; in these methods the classification of the individual items performed as an intermediate step may be accomplished by means of any classifier. The second subclass (Section 4.3) is instead composed of methods that, in order to classify the individual data items, rely on special-purpose learning methods devised with quantification in mind.

The second class (Section 4.3) is that of *non-aggregative* methods, i.e., methods that solve the quantification task "holistically", i.e., without classifying the individual items; these methods will be the subject of Section 4.4.

<sup>1</sup> Section 6.2 presents a lists of software tools implementing quantification methods, including many of those presented in this section.

A. Esuli et al., *Learning to Quantify*, The Information Retrieval Series 47, https://doi.org/10.1007/978-3-031-20467-8\_4

We start by discussing, in the next section, a method that belongs to none of the classes above, but that is sometimes considered as a trivial baseline in comparative experiments.

## **4.1 Maximum Likelihood Prevalence Estimation**

*Maximum Likelihood Prevalence Estimation* (MLPE) is not a real quantification method, but is sometimes used (see e.g., Barranquero et al., 2013) as a trivial baseline against which genuine quantification methods are compared. MLPE makes the naïve assumption that there is zero distribution shift between *L* and *U*, and thus consists of taking *pL(y)* as an estimate of *pU (y)*, i.e.,

$$
\hat{p}\_U^{\text{MLPE}}(\mathbf{y}) = p\_L(\mathbf{y}) \tag{4.1}
$$

This is the trivial predictor for quantification, somehow akin to always picking the majority class in classification.<sup>2</sup>

However, that MLPE should be used as a baseline at all is questionable. In fact, on a dataset where it indeed happens that *pU (y)* = *pL(y)*, MLPE cannot be beaten by any genuine quantification method, and will be hardly beaten by any such method on datasets characterised by very low distribution shift. However, this should not be taken to mean that these genuine quantification methods are ineffective, but should rather indicate that we have chosen the wrong dataset(s), since there is no point in applying quantification in environments characterised by the absence of distribution shift.<sup>3</sup>

As a side note, the assumption that *pU (y)* = *pL(y)* has been used in the past to justify classification policies. For instance, in the binary case, Yang (2001) defines a strategy (called "Pcut") for optimising classification thresholds, which consists

<sup>2</sup> Given a prediction task, an effectiveness measure *M* for it, and labelled and sets of unlabelled items *L* and *U* (assumed to be independently and identically distributed), the *trivial predictor* may be defined as the predictor we obtain if we attempt to maximise *M* on *U* by using only the output variables (and not the input variables) of *L*. When "vanilla" accuracy (i.e., the fraction of classification decisions that are correct) is the effectiveness measure, the classifier that always predicts the majority class is the trivial predictor for both binary and multiclass classification; under any reasonable effectiveness measure, MLPE is the trivial predictor for quantification.

<sup>3</sup> As an example, assume we are asked by a customer to set up a system that monitors, in a stream of data, the class prevalence of a certain class *y* of interest to the customer. (For instance, the data may be textual comments about a product marketed by the customer, and *y* = ⊕ may be the class of positive such comments.) Assume also that the customer provides us with a training set *L* of comments labelled according to *Y* = {⊕*,* }, where *pL(*⊕*)* = *k*. If the system we deliver to the customer is one that always returns *pσ (*⊕*)* = *k*, for any sample *σ* that we may sample from the stream, the customer would not be happy, even if we justify this by saying that, on our test data, this system has outperformed any other genuine quantification system we have tested.

of picking the threshold that causes *pU (y)* to be equal (or as close as possible) to *pL(y)*.

# **4.2 Aggregative Methods Based on General-Purpose Learners**

In this section and in Section 4.3 we will discuss quantification methods that have an *aggregative* nature, i.e., that first require a (hard or soft) classifier to issue a prediction for each individual item, and that then output an estimated class prevalence based on these individual predictions. Indeed quantification- and classification-related goals can be supported, to some degree, by the same training strategy (Tasche, 2021). All the methods discussed in this section can be applied on top of any supervised learning algorithm for training classifiers.

Of the methods discussed in this section,


For methods of type 2, these posterior probabilities should be well calibrated (in the sense discussed in Section 2.1). Some classifiers are known to return well calibrated probabilities (e.g., classifiers trained via logistic regression (Zadrozny and Elkan, 2002)). The posterior probabilities returned by some other classifiers are known instead to be not well calibrated (e.g., this is the case of the naïve Bayesian classifier (Domingos and Pazzani, 1997)). Yet some other classifiers (e.g., those trained via SVMs or AdaBoost) do not return posterior probabilities, but generic confidence scores. In these two last cases it is possible to map the obtained posterior probabilities / confidence scores into well calibrated posterior probabilities via some calibration method (Platt, 2000; Zadrozny and Elkan, 2002).

All this basically means that any supervised learning method can be used both for methods of type 1 and for methods of type 2 above. We now discuss all of these methods in increasing order of sophistication.

## *4.2.1 Classify and Count*

An obvious method for quantification consists of training a hard classifier *h* from *L* via a standard learning algorithm, classifying the items in sample *U*, and estimating *pU (y)* by simply counting the fraction of items in *U* that are predicted to belong to class *y*. This corresponds to computing

$$\begin{split} \hat{p}\_{U}^{\text{CC}}(\mathbf{y}) &= p\_{U}^{h}(\hat{\mathbf{y}}) \\ &= \frac{|\{\mathbf{x} \in U | h(\mathbf{x}) = \mathbf{y}\}|}{|U|} \end{split} \tag{4.2}$$

Forman (2008) calls this the *Classify and Count* (CC) method.

As already discussed in Section 1.2, CC is sub-optimal, because standard classifiers might be biased, i.e., generate severely unbalanced numbers of false positives and false negatives, and because they are usually tuned to minimise a measure of classification error, and not of quantification error. However, CC plays an important role in quantification research since it is always used as the trivial baseline which any reasonable quantification method must improve upon.

## *4.2.2 Probabilistic Classify and Count*

*Probabilistic Classify and Count* (PCC) is a variant of CC which consists of using *L* for training a probabilistic classifier *s* : *X* → [0*,* 1] |*Y*| , generating a posterior probability *p(y*|**x***)* for each item **x** ∈ *U* and for each class *y* ∈ *Y*, and computing *pU (y)* as the *expected* fraction of items predicted to belong to *y*. If by *E*[*x*] we indicate the expected value of *x*, this corresponds to computing

$$\begin{split} \hat{p}\_{U}^{\text{PCC}}(\mathbf{y}) &= E[p\_{U}^{h}(\hat{\mathbf{y}})] \\ &= \frac{1}{|U|} \sum\_{\mathbf{x} \in U} p(\mathbf{y}|\mathbf{x}) \end{split} \tag{4.3} $$

As a quantification method, PCC was first used by Bella et al. (2010), where it is called "Probability Average"4. The rationale of PCC is that posterior probabilities contain richer information than classification decisions, which are usually obtained from posterior probabilities via Equation 1.2. When using the classification decisions, CC does not leverage the quantitative information encoded in the *p(y*|**x***)*'s, which is discarded when using Equation 1.2, and this may be suboptimal.

As a quantification method, PCC was first evoked by Lewis (1995), who stated that "(. . . ) if our goal is to count class members, and if we have estimates of the probability of class membership, we should use the estimates directly to estimate the number of class members, rather than use them to classify documents." PCC was later dismissed *a priori* (i.e., without even being tested) as unsuitable by

<sup>4</sup> Tang et al. (2010) also use a method called "Probabilistic Classify and Count", and they also show that it outperforms CC. Their method might indeed coincide with the method discussed in this section, but the authors do not explain what their method precisely consists of.

Forman (2005, 2008), on the grounds that, when the training distribution *pL* and the unlabelled distribution *pU* are different (as they should be assumed to be in any application of quantification), probabilities calibrated on *L* (*L* being the only available set where calibration may be carried out) cannot be, by definition, calibrated for *U* at the same time (see also Section 2.1). Forman's criticism is indeed well-taken, since the assumption underlying the very notion of probability calibration is the IID assumption, whose consequence (namely, that class prevalence values are invariant across the training and the set of unlabelled items) is at odds with the very notion of quantification.

## *4.2.3 Adjusted Classify and Count*

*Adjusted Classify and Count* (ACC – also called "Adjusted Count" in Forman (2008) and the "Confusion Matrix Method" in Saerens et al. (2002)) requires training a hard classifier *h* from *L* via a standard learning algorithm, classifying the items in *U*, and then observing that, thanks to the law of total probability, it holds that

$$p\_U^h(\hat{\mathbb{y}}\_j) = \sum\_{\mathbb{y}\_l \in \mathcal{Y}} p\_U^h(\hat{\mathbb{y}}\_j|\mathbb{y}\_l) \cdot p\_U(\mathbb{y}\_l) \tag{4.4}$$

Here, *p<sup>h</sup> <sup>U</sup> (y*ˆ*<sup>j</sup>* |*yi)* represents the fraction of data items in *U* whose true class is *yi* and that have been instead assigned to class *yj* by classifier *h*. Once the classifier has been trained and applied to *U*, the quantity *p<sup>h</sup> <sup>U</sup> (y*ˆ*<sup>j</sup> )*, which represents the fraction of items in *U* that have been assigned *yj* by classifier *h*, can be observed, and the quantity *p<sup>h</sup> <sup>U</sup> (y*ˆ*<sup>j</sup>* <sup>|</sup>*yi)* can be estimated from *<sup>L</sup>* via *<sup>k</sup>*-fold cross-validation (*k*-FCV)5; the quantity *pU (yi)* is instead unknown, and is indeed the quantity we want to estimate. Since there are |*Y*| equations of the type described in Equation 4.4 (one for each *yj* ∈ *Y*), and since there are |*Y*| quantities of type *pU (yi)* to estimate (one for each *yi* ∈ *Y*), we are in the presence of a system of |*Y*| linear equations in |*Y*| unknowns. This system can be solved via standard techniques, thus yielding the *<sup>p</sup>*ˆACC *<sup>U</sup> (yi)* estimates.

In a nutshell, ACC is based on the idea of adjusting the results of CC by taking into account the propensity of the classifier to make misclassifications of a certain type. This is particularly evident in the binary case *Y* = {⊕*,* }, where Equation 4.4

<sup>5</sup> Barranquero et al. (2013); Forman (2005, 2008) actually use *stratified k*-fold cross-validation, i.e., the training set is split in such a way as to ensure that the class distribution is invariant across the different folds. Given that our goal is quantification (i.e., that we assume the presence of distribution shift), the rationale of using stratification seems dubious here, given that we do not have any guarantee that the distribution that stratification enforces in the various folds will be the same in the test set.

comes down to

$$p\_U^h(\hat{\oplus}) = p\_U^h(\hat{\oplus}|\oplus) \cdot p\_U(\oplus) + p\_U^h(\hat{\oplus}|\ominus) \cdot p\_U(\ominus)$$

$$= \text{TPR}\_U \cdot p\_U(\oplus) + \text{FPR}\_U \cdot p\_U(\ominus) \tag{4.5}$$

$$= \text{TPR}\_U \cdot p\_U(\oplus) + \text{FPR}\_U \cdot (1 - p\_U(\oplus))$$

where by TPR <sup>=</sup> TP TP+FN and FPR <sup>=</sup> FP FP+TN we indicate the true positive rate (a.k.a. "recall", or "sensitivity") and false positive rate (a.k.a. "specificity"), resp., that the classifier has obtained. From Equation 4.5 we obtain

$$\begin{split}p\_U(\oplus) &= \frac{p\_U^h(\oplus) - \text{FPR}\_U}{\text{TPR}\_U - \text{FPR}\_U} \\ &= \frac{\hat{p}\_U^{\text{CC}}(\oplus) - \text{FPR}\_U}{\text{TPR}\_U - \text{FPR}\_U} \end{split} \tag{4.6}$$

from which, if by TPR*L* and FPR*L* we indicate the true positive rate and false positive rate, resp., that have been estimated by *k*-FCV, we derive

$$
\hat{p}\_U^{\text{ACC}}(\oplus) = \frac{\hat{p}\_U^{\text{CC}}(\oplus) - \text{FPR}\_L}{\text{TPR}\_L - \text{FPR}\_L} \tag{4.7}
$$

ACC can be proved to be *Fisher-consistent* under prior probability shift (Tasche, 2017), which is a guarantee that the provided estimate *p*ˆ*<sup>U</sup> (y)* would be correct if computed on the whole populations of interest (instead of the available samples *L* and *U* of limited cardinality), on condition that the training and unlabelled populations are linked by *prior probability shift*.

Fisher consistency is related to an estimator being unbiased, and has been proposed as a desirable property of a quantification method (Tasche, 2017). Fisher consistency does not provide any practical guarantee, given it discounts the randomness of empirical distributions sampled in the real world. However, a quantification method lacking this property can be seen as problematic, given that, even for large sample size, it may end up providing poor estimates of class prevalence. Thus it can be seen as a necessary, not sufficient, property of a good quantification method. Dataset shifts (Section 1.5) close to, but slightly deviating from, prior probability shift can cause a loss of Fisher consistency for ACC (Tasche, 2017).

One problem with ACC is that the *<sup>p</sup>*ˆACC *<sup>U</sup> (yi)*'s are not guaranteed to be in [0,1], due to the fact that the estimates of the *p<sup>h</sup> <sup>U</sup> (y*ˆ*<sup>j</sup>* |*yi)*'s may be inaccurate, i.e., substantially different from the true *p<sup>h</sup> <sup>U</sup> (y*ˆ*<sup>j</sup>* <sup>|</sup>*yi)*'s.<sup>6</sup> In fact, ACC is based on the hypothesis that the *p<sup>h</sup> <sup>U</sup> (y*ˆ*<sup>j</sup>* |*yi)*'s are invariant across the training set and the set of

<sup>6</sup> This problem had already been noted by Lew and Levy (1989).

unlabelled items, which is questionable in the presence of distribution shift.<sup>7</sup> The fact that the *<sup>p</sup>*ˆACC *<sup>U</sup> (yi)*'s may not be in [0,1] particularly affects classes characterised by a low or very low prevalence (which are ubiquitous in e.g., text classification): in these case it may well be that *p*ˆ CC *<sup>U</sup> (y) <* FPR*L*, which means that, since in these scenarios it is usually the case that TPR*<sup>L</sup> >* FPR*L*, ACC returns a negative value.

This problem has led most authors (see e.g., Forman, 2008) to rely on "clipping and rescaling", i.e., (i) "clip" the *p*ˆ*<sup>U</sup> (yi)* estimates (i.e., equate to 1 every value higher than 1 and to 0 every value lower than 0), and (ii) rescale them so that they sum up to 1. Clipping is a hardly justified heuristics, though, and if the values to be clipped are either much smaller than 0 or much higher than 1, it can seriously bias the results. A better alternative (that does away with clipping, but that – to the best of our knowledge – has never been discussed in the literature) might consist of giving the *p<sup>h</sup> <sup>U</sup> (y*ˆ*<sup>j</sup> )* values obtained from Equation 4.4 as input to a "softmax" function

$$\sigma(\mathbf{x}) = \frac{e^{\mathbf{x}}}{\sum\_{\mathbf{x}\_l} e^{\mathbf{x}\_l}} \tag{4.8}$$

whose effect is to monotonically map the *p*ˆ*<sup>U</sup> (y*ˆ*<sup>j</sup> )*'s obtained from solving the system of linear equations, to |*Y*| values in [0,1] that sum up to 1. The values returned by the softmax would then be used as the final *<sup>p</sup>*ˆ*ACC <sup>U</sup> (y*ˆ*<sup>j</sup> )* class prevalence estimates in place of the values computed from the system of linear equation.

ACC is actually very old, since its binary version goes back at least to (Gart and Buck, 1966), where, in an application pertaining to epidemiology, it was used in order to determine the prevalence of a given disease from the results of a screening test with known true positive rate and true negative rate (see Section 6.4 for more on this).<sup>8</sup> As a quantification method, the earliest recorded use of it is in Vucetic and Obradovic (2001).

## *4.2.4 Probabilistic Adjusted Classify and Count*

*Probabilistic Adjusted Classify and Count* (PACC) is a probabilistic variant of ACC, i.e., it stands to ACC as PCC stands to CC. Its underlying idea is to replace both side

<sup>7</sup> Esuli and Sebastiani (2015, Appendix A) show an example in which this assumption is far from being verified in actual data.

<sup>8</sup> Several past works erroneously attribute this method to Levy and Kass (1970); in reality, the latter paper does use the method, but the authors correctly attribute its paternity to Gart and Buck (1966).

of Equation 4.4 with their expected values. Equation 4.4 is thus transformed into

$$\begin{aligned} E[p\_U^h(\hat{\mathbf{y}}\_j)] &= E[\sum\_{\mathbf{y}\_i \in \mathcal{Y}} p\_U^h(\hat{\mathbf{y}}\_j|\mathbf{y}\_i) \cdot p\_U(\mathbf{y}\_i)] \\ &= \sum\_{\mathbf{y}\_i \in \mathcal{Y}} E[p\_U^h(\hat{\mathbf{y}}\_j|\mathbf{y}\_i) \cdot p\_U(\mathbf{y}\_i)] \\ &= \sum\_{\mathbf{y}\_i \in \mathcal{Y}} E[p\_U^h(\hat{\mathbf{y}}\_j|\mathbf{y}\_i)] \cdot p\_U(\mathbf{y}\_i) \end{aligned} \tag{4.9}$$

where the last passage is justified by the fact that *pU (yi)* is a constant, and where

$$E[p\_U^h(\hat{\mathbf{y}}\_\prime)] = \frac{1}{|U|} \sum\_{\mathbf{x} \in U} p(\mathbf{y}\_\prime | \mathbf{x})$$

$$= \hat{p}\_U^{\text{PCC}}(\mathbf{y}\_\prime) \tag{4.10}$$

$$E[p\_U^h(\hat{\mathbf{y}}\_\prime | \mathbf{y}\_\prime)] = \frac{1}{|U\_l|} \sum\_{\mathbf{x} \in U\_l} p(\mathbf{y}\_\prime | \mathbf{x})$$

and *Ui* indicates the set of items in *U* whose true class is *yi*. Like for ACC, once the (soft) classifier has been trained and applied to *<sup>U</sup>*, the quantity *<sup>E</sup>*[*p<sup>h</sup> <sup>U</sup> (y*ˆ*<sup>j</sup> )*] can be observed, and the quantity *<sup>E</sup>*[*p<sup>h</sup> <sup>U</sup> (y*ˆ*<sup>j</sup>* |*yi)*] can be estimated from *L* via *k*-fold crossvalidation, which means that we are again in the presence of a system of |*Y*| linear equations in |*Y*| unknowns, that we can solve by the usual techniques. In the binary case, Equation 4.9 simplifies as

$$E[p\_U^h(\hat{\oplus})] = E[p\_U^h(\hat{\oplus}|\oplus)] \cdot p\upsilon(\oplus) + E[p\_U^h(\hat{\oplus}|\ominus)] \cdot p\upsilon(\ominus) \tag{4.11}$$

from which, similarly to the case of ACC (Equations 4.5 to 4.7), we can derive

$$\hat{p}\_U^{\text{PACC}}(\oplus) = \frac{\hat{p}\_U^{\text{PCC}}(\oplus) - E[p\_L^{\hbar}(\hat{\oplus}|\ominus)]}{E[p\_L^{\hbar}(\hat{\oplus}|\oplus)] - E[p\_L^{\hbar}(\hat{\oplus}|\ominus)]} \tag{4.12}$$

where *<sup>E</sup>*[*p<sup>h</sup> <sup>L</sup>(*⊕|⊕ˆ *)*] and *<sup>E</sup>*[*p<sup>h</sup> <sup>L</sup>(*⊕|ˆ *)*] are the probabilistic counterparts of TPR and FPR in ACC, i.e.,

$$\begin{aligned} E[p\_L^h(\hat{\oplus}|\oplus)] &= \frac{1}{|\mathbf{x} \in L : \Phi(\mathbf{x}) = \oplus|} \sum\_{\mathbf{x} \in L : \Phi(\mathbf{x}) = \oplus} p(\oplus|\mathbf{x}) \\\ E[p\_L^h(\hat{\oplus}|\ominus)] &= \frac{1}{|\mathbf{x} \in L : \Phi(\mathbf{x}) = \ominus|} \sum\_{\mathbf{x} \in L : \Phi(\mathbf{x}) = \ominus} p(\oplus|\mathbf{x}) \end{aligned} \tag{4.13}$$

PACC was first proposed by Bella et al. (2010). Like PCC, also PACC is dismissed as unsuitable by Forman (2005, 2008), essentially for the same reasons for which he also dismisses PCC and already mentioned in Section 4.2.2.

Like ACC, also PACC can return values for *p*ˆ*<sup>U</sup> (yi)* that fall off the [0,1] range. Again, clipping and rescaling has been used in the literature to deal with these cases; again, applying a softmax (as suggested in Section 4.2.3 for ACC) may prove a better idea.

## *4.2.5 X, MAX, and Threshold@0.50*

The methods we will describe in this section are binary-only quantification methods (i.e., multiclass versions have not been discussed in the literature, and are nonobvious) proposed by Forman (2006, 2008) and arising from a critical analysis of the ACC method.

Assume a binary quantification task with classes *Y* = {⊕*,* }. Equation 4.7 is such that, in principle, *<sup>p</sup>*ˆACC *<sup>U</sup> (*⊕*)* is undefined when TPR*<sup>L</sup>* = FPR*L*: however this is not problematic in practice, since a classifier such that TPR = FPR is, as observed by Forman (2005), too bad to arise in real-life situations (since it is usually the case that TPR is higher, or much higher, than FPR). Forman (2008) points out that ACC is very sensitive to the decision threshold of the classifier, which may make *<sup>p</sup>*ˆACC *<sup>U</sup> (*⊕*)* behave erratically. In particular he points out that, if ⊕ is an infrequent class and the classifier is optimised for standard accuracy, the classifier may have a tendency to almost always predict , i.e., to deliver very small values of TPR and FPR. With such small values, the denominator of Equation 4.4 may be highly unstable and very small anyway, thus jeopardising the method. The methods discussed in this section are instances of ACC that use a threshold different from the standard one: in particular, Forman devised these methods with the goal of choosing "a threshold that admits more true positives and many more false positives, yielding worse classifier accuracy but better quantifier accuracy".

One solution proposed by Forman (2006, 2008) is to heuristically set the decision threshold in such a way that FPR*<sup>L</sup>* = 1 − TPR*<sup>L</sup>* (this method is dubbed X) and then use Equation 4.7. The claimed rationale of this heuristics is to avoid the tails of the FPR*L(t)* and 1 − TPR*L(t)* curves, where *t* is the decision threshold.

An alternative heuristic that Forman (2006, 2008) discusses is to set the decision threshold in such a way that *(*TPR*<sup>L</sup>* − FPR*L)* is maximised (this is dubbed MAX). Here, the rationale is to avoid small values in the denominator of Equation 4.7, with the goal of avoiding the above-mentioned instability in the final values computed by the equation.

Yet another heuristics proposed by Forman (2006, 2008) is to set the decision threshold in such a way that TPR*<sup>L</sup>* is equal to *.*50 and then use Equation 4.7; this method is dubbed *Threshold*@*0.50* (T50). The reason why the author proposes this is that such a threshold tends to be good at avoiding the tail of the 1 − TPR*L(t)* curve.

Forman (2008) argues that it is especially in highly imbalanced datasets that *(*TPR*<sup>L</sup>* − FPR*L)* risks being low; the three methods introduced in this section are thus meant to be helpful especially in contexts characterised by high imbalance.

One problem that seems to affect these methods is that, while the thresholds they choose tend to have some desired properties (e.g., avoiding small values in the denominator of Equation 4.7), these properties do not seem correlated to the one property which would seem of interest here, i.e., the fact that the resulting values of TPR*L* and FPR*L* are accurate estimates of TPR*U* and FPR*U* .

## *4.2.6 Median Sweep*

An alternative, binary-only quantification method, proposed by Forman (2006, 2008), consists of computing *<sup>p</sup>*ˆACC *<sup>U</sup> (*⊕*)* for every decision threshold that gives rise (in *k*-fold cross-validation) to different TPR*<sup>L</sup>* or FPR*<sup>L</sup>* values, and take the median of all the resulting estimates of *<sup>p</sup>*ˆACC *<sup>U</sup> (*⊕*)*. This method is dubbed *Median Sweep* (MS), and its rationale lies in the ability of the median to avoid outliers. Following this intuition, Forman (2006, 2008) proposed another variant of this method, called MS2, which computes the median only for cases in which TPR*<sup>L</sup>* − FPR*<sup>L</sup> >* 0*.*25.

Again, similarly to what we said about the methods discussed in Section 4.2.5, the problem with this method is that there does not seem to be any *a priori* reason that the median of the estimates of *<sup>p</sup>*ˆACC *<sup>U</sup> (*⊕*)* brought about by all possible decision thresholds, while probably not an outlier, is any closer to the true value of *p*ACC *<sup>U</sup> (*⊕*)* than any of the estimates generated by other "legitimate" methods such as CC or ACC.

## *4.2.7 The Ratio Estimator*

The ACC and PACC methods (Sections 4.2.3, 4.2.4) along with their heuristic spin-offs (Sections 4.2.5 and 4.2.6), have one aspect in common: the solution they propose is based on a specialised version of a general equation, which has the form

$$\hat{p}\_U^{\text{RE}}(\oplus) = \frac{E\_U[\mathbf{g}(\mathbf{x})] - E\_L[\mathbf{g}(\mathbf{x})|\Phi(\mathbf{x}) = \ominus]}{E\_L[\mathbf{g}(\mathbf{x})|\Phi(\mathbf{x}) = \oplus] - E\_L[\mathbf{g}(\mathbf{x})|\Phi(\mathbf{x}) = \ominus]} \tag{4.14}$$

(for simplicity we here deal with the binary case) where *Eσ* [*x*] indicates, as usual, the expected value of *x* in sample *σ*. Here *g(***x***)* is a function of the covariates which can be specialised to obtain Equations 4.7 and 4.12 as


This is a key result from Fernandes Vaz et al. (2019), where the authors dub the family of functions described by Equation 4.14 the *ratio estimator* (RE), which is shown, in its entirety, to be Fisher-consistent (a property defined in Section 4.2.3) under prior probability shift. Moreover, the authors prove a Central Limit Theorem (CLT) for RE, allowing the practitioner to approximate the mean squared error (MSE) for the prevalence estimates. Subsequently, they propose a way for selecting the function *g(***x***)* based on explicit MSE minimisation, as estimated via the CLT, i.e.,

$$\text{MSE}(\hat{p}\_U(\oplus)) \simeq \frac{1}{(\hat{\mu}\_\oplus - \hat{\mu}\_\ominus)^2 |L|} \left( \frac{\hat{p}\_U(\oplus)^2 \hat{s}\_\oplus^2}{p\_L(\oplus)} + \frac{\hat{p}\_U(\ominus)^2 \hat{s}\_\ominus^2}{p\_L(\ominus)} \right) \tag{4.15}$$

where we have defined the sample moments

$$
\hat{\mu}\_{\oplus} = \frac{1}{|\{\mathbf{x} \in L : \Phi(\mathbf{x}) = \oplus\}|} \sum\_{\mathbf{x} \in L : \Phi(\mathbf{x}) = \oplus} \mathbf{g}(\mathbf{x})
$$

$$
\hat{s}\_{\oplus}^{2} = \frac{1}{|\{\mathbf{x} \in L : \Phi(\mathbf{x}) = \oplus\}|} \sum\_{\mathbf{x} \in L : \Phi(\mathbf{x}) = \oplus} \left(\mathbf{g}(\mathbf{x}) - \hat{\mu}\_{\oplus}\right)^{2}.
$$

Here MSE is clearly shown to decrease when the difference between *μ*ˆ <sup>⊕</sup> and *μ*ˆ is high. This provides theoretical support for choosing *g(***x***)* as the output of a classifier (a function specifically aimed at separating ⊕ from ), as is the case with ACC and PACC. In general, it is worth noting that a clear characterisation of the relationship between classification and quantification performance remains under-explored in the literature, except for some recent initial results (Tasche, 2021). Additionally, the CLT proved for RE can be exploited to compute confidence intervals (CI) for quantification estimates.<sup>9</sup> CIs for prevalence estimation are discussed more in depth in Section 5.8.

Finally, the authors generalise their results to two novel quantification scenarios. In one scenario, some labels from the target population *U* are available, thus providing some additional information, not available in typical quantification settings, which can be exploited via a weighted average

$$
\hat{p}\_U^{\text{AVG}}(\oplus) = w \cdot \hat{p}\_U^{\text{RE}}(\oplus) + (1 - w) \cdot \hat{p}\_U^{\text{ML}}(\oplus) \tag{4.16}
$$

between the ratio estimator *<sup>p</sup>*ˆRE *<sup>U</sup> (*⊕*)* and a maximum likelihood estimator *<sup>p</sup>*ˆML *<sup>U</sup> (*⊕*)* based on available labels from population *U* (Section 4.1), weighted according to

<sup>9</sup> CI are typically defined with respect to a coverage level *(*<sup>1</sup> <sup>−</sup> *α)*, and define a real-valued interval [ˆ *lo,* ˆ *hi*]⊆[0*,* 1] around an estimation point ˆ so that the probability (from a frequentist point of view) for the interval to contain the true value <sup>∗</sup> is *(*1 − *α)*%.

the MSE of each estimator. This setting is related to active learning in data streams, described in Section 5.6.

In a second scenario, we are interested in an estimate of prevalence with finer granularity, dependent on a covariate of interest *z*. One example of practical interest described by this scenario is the release of an improvement for a given product. In this case one might be interested in verifying the users' reaction to the novelty by segmenting the population of user reviews according to a temporal variable *z*. A minimal extension of the ratio estimator, i.e.,

$$
\hat{p}\_U^{\text{RE}}(\oplus|z) = \frac{E\_U[\lg(\mathbf{x})|z] - E\_L[\lg(\mathbf{x})|\ominus]}{E\_L[\lg(\mathbf{x})|\oplus] - E\_L[\lg(\mathbf{x})|\ominus]} \tag{4.17}
$$

can solve this task.

# *4.2.8 Mixture Models*

Forman (2005, 2008) proposed a method for quantification based on Mixture Models (MM). MM is yet another binary-only quantification method (i.e., no SLQ extension has surfaced in the literature to date). MM assumes that the cumulative distribution *F <sup>U</sup>* (shorthand for *F <sup>U</sup> (s(***x***))*) of the scores assigned to data points in *U* is a mixture

$$F^U = p\_U(\oplus) \cdot F^U\_{\oplus} + p\_U(\ominus) \cdot F^U\_{\ominus} \tag{4.18}$$

where *F <sup>U</sup>* <sup>⊕</sup> and *<sup>F</sup> <sup>U</sup>* are the cumulative distributions of the scores that the classifier assigns to the positive and to the negative unlabelled examples, respectively, and where *pU (*⊕*)* and *pU ()* = *(*1 − *pU (*⊕*))* are the parameters of this mixture. The MM method consists of estimating *F <sup>U</sup>* <sup>⊕</sup> and *<sup>F</sup> <sup>U</sup>* via *k*-fold cross-validation on *L*, and picking as value of *pU (*⊕*)* the one that generates the best fit between the observed *F <sup>U</sup>* and the mixture. It is worth noting that this approach may work with generic scoring functions *s(***x***)* that are not necessarily the output of a soft classifier.

Two variants of this method, called the *Kolmogorov-Smirnov Mixture Model* (MM(KS)) and the *PP-Area Mixture Model* (MM(PP)), are actually defined by Forman (2005), which differ in terms of how the goodness of fit between the leftand the right-hand side of Equation 4.18 is estimated.

Essentially, any method for measuring this goodness of fit can be used in connection with the MM method. Another MM method is HDy, proposed by González-Castro et al. (2013). The difference between HDy and the two previously discussed methods is that in HDy the *Hellinger Distance* (HD, an instance of the class of divergences, that we discussed in Section 3.1) is used to compare two distributions, i.e.,

$$\begin{split} \hat{\boldsymbol{p}}\_{U}^{\text{HDy}}(\oplus) &= \text{HDy}(f\_{\oplus}^{L}, f\_{\ominus}^{L}, f^{U}) \\ &= \text{arg}\min\_{0 \le a \le l} \{ \text{HD}(a f\_{\oplus}^{L} + (1 - a) f\_{\ominus}^{L}, f^{U}) \} \end{split} \tag{4.19}$$

where *f <sup>L</sup>* <sup>⊕</sup> and *<sup>f</sup> <sup>L</sup>* are the probability density functions of scores (e.g., the output of a soft classifier) for the positive and negative samples of *L*, respectively, obtained via *k*-fold cross-validation on *L*, and *f <sup>U</sup>* is the distribution of scores obtained for *U* by the classifier trained on *L*. Notice that *f <sup>L</sup>* <sup>⊕</sup> , *<sup>f</sup> <sup>L</sup>* , and *<sup>f</sup> <sup>U</sup>* are empirically approximated with histograms.

Maletzke et al. (2019) proposed the *Distribution y-Similarity* (DyS) framework, that generalises the HDy approach by considering the dissimilarity function DS as a parameter of the model. A dissimilarity function compares two probability distributions, i.e., the same process of HDy, but uses a different distance function. In this case, the authors approximate probability distributions with histograms, and test a variety of distance functions (Maletzke et al., 2019, Table 1), i.e., Squared Euclidean (SEc), Manhattan (MH), Probabilistic Symmetric (PS), Topsøe (TD), Jensen Difference (JD), Taneja (TN), Hellinger (HD), Dice (DC), Jaccard (JC), Chebyshev (CB), Inner Product (IP), Kumar-Hassebrook (HB), Cosine (CS), and Harmonic Mean (HM). The authors also propose distance functions that do not operate on distributions but directly compare the scores assigned to samples, i.e., Mixable Kolmogorov-Smirnov (MKS) and Sample Ordinal Distance (SORD)10. The DyS framework is defined as

$$\begin{split} \hat{\rho}\_{U}^{\text{Dys}}(\oplus) &= \text{DysS}(f\_{\oplus}^{L}, f\_{\ominus}^{L}, f^{U}) \\ &= \underset{0 \le \alpha \le 1}{\text{arg min}} \{ \text{DS}(\alpha f\_{\oplus}^{L} + (1 - \alpha) f\_{\ominus}^{L}, f^{U}) \} \end{split} \tag{4.20}$$

where *f <sup>L</sup>* <sup>⊕</sup> , *<sup>f</sup> <sup>L</sup>* , and *<sup>f</sup> <sup>U</sup>* are the same probability distributions that appear in Equation 4.19.

Experiments in Maletzke et al. (2019) indicate that the Topsøe distance performs better than all the other compared distance measures. The Topsøe distance is a symmetric version of the Kullback-Leibler divergence (Johnson and Sinanovic, 2001), and is defined as

$$\text{TD}(f,\mathbf{g}) = \text{KLD}(f,m) + \text{KLD}(\mathbf{g},m) \tag{4.21}$$

where *<sup>m</sup>* <sup>=</sup> <sup>1</sup> <sup>2</sup> *(f* + *g)*.

<sup>10</sup> For MKS and SORD, Maletzke et al. (2019) presents a specific implementation of the Equation 4.20 that computes the distance function from the classification scores rather than from probability distributions.

Moreira dos Reis et al. (2018b) explore the use of HDy in a *recurrent contexts scenario*, i.e., they assume that the distribution of the data may change only among a limited set of possible distributions. They assume the availability of training data for all the possible contexts *LCi* for *i* ∈ {1*,* 2*, ..,*|*C*|}, each one representing a possible distribution. They propose two extensions of the HDy method, i.e.,

• *Single Most Relevant HDy* (SMR-HDy). This method applies HDy (Equation 4.19) to each context *LCi* , selecting the context *LCm* that minimises the Hellinger distance, and returns the prevalence estimate associated to that context, i.e.,

$$\mathcal{C}\_m = \arg\min\_{i \in \mathcal{C}} \text{HDy}(f\_{\oplus}^{L\_{\mathcal{C}\_l}}, f\_{\ominus}^{L\_{\mathcal{C}\_l}}, f^U) \tag{4.22}$$

$$
\hat{p}\_U^{\text{SMR}-\text{HDy}}(\oplus) = \text{HDy}(f\_{\oplus}^{L\_{\text{Cm}}}, f\_{\ominus}^{L\_{\text{Cm}}}, f^U) \tag{4.23}
$$

• *Crossed-Opinions HDy* (XO-HDy), in which the data in every context is split into two parts, train *T rCi* and validation *V aCi* , resulting in more complex procedure for the selection of the most likely context *Cm*. Specifically, each training set is used to learn a classifier, which is then applied to every validation set, producing |*C*| <sup>2</sup> classifications. The distribution *f Lij* for each of such classifications is computed. The distribution of *U* is compared with all the distributions *f Lij* to find the most plausible context for unlabelled data, i.e.,

$$C\_m = \arg\min\_{j \in C} \frac{1}{|C|} \sum\_{l \in C} \text{HD}(\alpha\_{lj} f\_{\oplus}^{Lij} + (1 - \alpha\_{lj}) f\_{\ominus}^{Lij}, f^{U\_l}) \tag{4.24}$$

where *f Lij* is the distribution of the scores obtained by a classifier trained in context *Ci* on the validation set from context *Cj* , *f Ui* is the distribution obtained by the same classifier on *<sup>U</sup>*, and *αij* <sup>=</sup> HDy*(f Lij* <sup>⊕</sup> *, f Lij , f Ui)*. Finally, XO-HDy returns the prevalence estimate *p*ˆ HDy*,LCm <sup>U</sup> (*⊕*)* associated to that context. Notice that, after selecting the most likely context for *U*, prevalence estimate can be provided by quantification methods other than HDy.

Also grounded in mixture models, the method *Gain-Some-Lose-Some* (GSLS) Denham et al. (2021) was proposed as a means to counter the effects of dataset shift in class prevalence estimation. The authors argue that GSLS is designed to deal with forms of dataset shift other than prior probability shift.

The method assumes that the observed probability distribution *f L(***x***)* and the target distribution *f <sup>U</sup> (***x***)*, hereafter shortened to *f <sup>L</sup>* and *f <sup>U</sup>* , are related by means of an intermediate distribution *f <sup>R</sup>*. This intermediate distribution is indicated with an *R* standing for "remaining distribution", since the framework assumes that the source distribution *f <sup>L</sup>* can be composed as a mixture of the *f <sup>R</sup>* distribution and a *loss* distribution *f* <sup>−</sup> (i.e., that *f <sup>R</sup>* is *what remains* after *losing* (subtracting) a distribution *f* <sup>−</sup> from *f <sup>L</sup>*) and *f <sup>U</sup>* can be composed as a mixture of the *f <sup>R</sup>*

distribution and a *gain* distribution *f* <sup>+</sup> (i.e., the target distribution *f <sup>U</sup>* is obtained from *f <sup>R</sup>* by adding a *gain* distribution *f* <sup>+</sup>). That is,

$$f^L = w^- f^- + (1 - w^-) f^R \tag{4.25}$$

$$f^U = w^+ f^+ + (1 - w^+) f^R \tag{4.26}$$

where *w*− and *w*+ are the weights of the mixtures, i.e., the amount of loss and gain, respectively. Note that the distributions *f* − and *f* + refer to subpopulations, *lose* and *gain* respectively, and have nothing to do with the classes of a binary problem; indeed, GSLS is formulated for the general multiclass setting.

Without further assumptions, there are infinitely many ways for choosing distributions *f* <sup>−</sup>, *f* <sup>+</sup>, and *f <sup>R</sup>*, and weights *w*<sup>−</sup> and *w*+. GSLS thus makes some simplifications and imposes some constraints to render the problem tractable. One such simplification consists of modelling, as other methods like HDy do, the distributions as *b*-bin histograms of the outputs produced by a probabilistic classifier. For doing so, and in order to avoid overfitting, the classifier is trained on separate source data, and the histogram is computed on held-out validation source data. Then, the unknowns *f <sup>R</sup>*, *f* <sup>−</sup>, *f* <sup>+</sup>, *w*−, and *w*<sup>+</sup> are searched by solving an optimisation problem that attempts to minimise the degree of (gain and loss) shift. This is akin to minimising *w*<sup>−</sup> + *w*<sup>+</sup> and constraining the bins of the histogram for *f <sup>R</sup>* to lie between those of *f <sup>L</sup>* and *f <sup>U</sup>* .

Once all distributions and mixture weights have been fixed, GSLS uses knowledge from the source distribution to compute quantification predictions, along with their corresponding confidence intervals, for the classes of interest. GSLS computes *pU (*⊕*)* via maximum likelihood, for which a number of assumptions are needed. The most important one comes down to assuming the proportion of target samples belonging to <sup>⊕</sup> to follow a binomial distribution *B(*|*U*|*, pU (*⊕*))* scaled by <sup>1</sup> |*U*| , and establishing a relation between the unknown *pU (*⊕*)* and the (already known) factors of the mixture model. Further assumptions on the underlying distributions allow GSLS to express this probability in terms of other parametric distributions; the complete mathematical derivation is explained in Denham et al. (2021).

## *4.2.9 Expectation Maximisation for Quantification*

All the methods discussed so far have an *inductive* nature, since the quantification model is trained exclusively on the training set. Some quantification methods proposed in the literature have instead a *transductive* nature (see e.g., Joachims, 1999), i.e., they are trained by also looking at certain characteristics of the unlabelled examples they need to issue predictions for (although not at their labels). Because of this, the model generated may fit the designated test set better than the models generated via inductive methods, but is less general, since it is especially tailored to the very set of unlabelled items used in the training phase, and may underperform when applied to different sets of unlabelled items. The *Saerens-Latinne-Decaestecker* (SLD) algorithm, proposed by Saerens et al. (2002), has a transductive component, since it applies a transductive correction to the test predictions (issued by an inductive classifier).

SLD is an instance of Expectation Maximisation (Dempster et al., 1977), the well-known iterative algorithm for finding maximum-likelihood estimates of parameters (in our case: the class prevalence values) for models that depend on unobserved variables (in our case: the class labels). Essentially, SLD (see Algorithm 1) incrementally updates (Line 10) the posterior probabilities by using the class prevalence values computed in the last step of the iteration, and updates (Line 14) the class prevalence values by using the posterior probabilities computed in the last step of the iteration, in a mutually recursive fashion.

**Algorithm 1:** The SLD algorithm (Saerens et al., 2002).

Like ACC (Section 4.2.3), SLD was proven to be *Fisher-consistent under prior probability shift* (Tasche, 2017). In the same work, the author provides a counterexample of dataset shift under which SLD loses Fisher consistency.

Alaíz-Rodríguez et al. (2011) propose an extension of SLD, based on the assumption that each class can be decomposed into several subclasses and that the change in the prevalence of the class is actually determined by the change in the prevalence of its subclasses11. The method that Alaíz-Rodríguez et al. (2011) propose consists of two main steps. The first step consists of estimating the number of subclasses and their prior probabilities. To do so, an iterative method called *Posteriori Probability Model Selection* (PPMS) Arribas and Cid-Sueiro (2005) is applied to *L*. PPMS applies pruning, splitting, and merging criteria, to dynamically choose the optimal number of subclasses of each class during training. The output is not only the number of subclasses per class, but also their prior probabilities and the posterior probabilities of each item, as computed by a two-layered feedforward network called *Generalised Softmax Perceptron* (GPS) (Guerrero-Curieses et al., 2005). The second step applies an extension of SLD that jointly adjusts for the class and subclass probabilities.

Re-estimating class prevalence values at subclass level was empirically shown to yield improved results when compared to SLD in the experiments of Alaíz-Rodríguez et al. (2011). One interesting experiment showed that (artificially) introducing concept shift at subclass level (within classes), yet leaving the prior probabilities unchanged at the class level, might cause the application of SLD (at class level) to be detrimental with respect to not performing re-estimation at all.

## *4.2.10 Class Distribution Estimation*

Xue and Weiss (2009) propose a similar procedure dubbed *Class Distribution Estimation Iterate* (CDE-Iterate). This procedure is primarily aimed at improving classification accuracy, while the improvement of quantification estimates (i.e., class priors) is considered by the authors as an accessory step toward the main goal.

CDE-Iterate employs *cost-sensitive learning*, by assigning different values to the cost of false negatives (*c*FN), and false positives (*c*FP). The ratio between the two costs is kept proportional to a value associated with the shift in prior probabilities, by enforcing

$$c\_{\rm FP} = \frac{p\_L(\oplus)}{p\_L(\ominus)} \frac{\hat{p}\_U(\ominus)}{\hat{p}\_U(\oplus)} c\_{\rm FN} \tag{4.27}$$

<sup>11</sup> The problem setting they address is single-label, both at class and subclass levels (i.e., data items labelled as *yi* belong to strictly one of the *j* subclasses of *yi*).

This probabilities shift is defined as the ratio between positive-to-negative rate in *L* and positive-to-negative rate in *U*; the former is readily available, whereas the latter is estimated via Classify and Count, which is also the quantification method used to determine quantification estimates.

The algorithm, iterative and transductive in nature, starts training a cost-sensitive hard classifier *h* on *L*, using *c*FP = *c*FN = 1. The value *c*FN does not change through the execution of the method. The main iteration of the algorithm consists of the following steps:


A key disadvantage with respect to SLD is the need to retrain a cost-sensitive classifier *h* at each iteration, slightly compensated by the possibility of wrapping CDE-Iterate around hard classifiers, without requiring *h* to output posterior probabilities. A further disadvantage is the lack of Fisher consistency under priorprobability shift.

## *4.2.11 Ensemble Methods for Quantification*

Attempts have been made at characterising the applicability of ensemble techniques to problems of binary quantification. This paradigm was proposed in early quantification work (Forman, 2006, 2008), which focuses on the choice of an optimal threshold for a classifier that would allow a good estimate for true positive rate and false positive rate. An approach dubbed *Median Sweep* (Section 4.2.6) is proposed, which considers different classifier thresholds each yielding a different estimate of class prevalence via Adjusted Classify and Count. The estimates are aggregated by computing their median, which is regarded as the final quantification result.

In more recent years, a line of work has emerged in quantification literature, solely focused on ensembles (Pérez-Gállego et al., 2017, 2019). The key idea for this paradigm is training multiple quantifiers introducing diversity in the skew level of the training set employed for each model. At testing time, the outputs from each model (or a subset thereof) are suitably aggregated into a single prevalence estimate. In its simplest form, the algorithm can be conceptually divided into 3 steps:


3. Output aggregation: the final estimate for class prevalence of the unlabelled set *U* is computed as the arithmetic mean of the outputs from all models from the ensemble.

Pérez-Gállego et al. (2019) expand on the above work by considering different combination techniques for output aggregation (Step 3). The strategy employed consists of discarding half of the learnt models, thereafter averaging the output of the remaining ones. If model selection is carried out considering only labelled data, the resulting procedure is dubbed *static*, as opposed to a *dynamic* approach whereby the choice also takes into account the unlabelled dataset *U*. Two *static* methods are proposed for the selection of the strongest models:


Two more criteria are proposed that embrace a *dynamic* approach:


After ranking models based on *static* or *dynamic* criteria, the top half is selected and the respective estimates are averaged, thus yielding the final estimate of class prevalence.

## *4.2.12 QuaNet*

A first attempt at using a deep neural network for text quantification is presented in Esuli et al. (2018). The network takes as input the classification scores for each document in the sample to be quantified produced by document classifier, a document embedding for each document in the sample, and the output of an ensemble of quantification methods.

The QuaNet neural network is composed of two main parts (see Figure 4.1). Given a set of documents on which to perform the quantification, a first part of the network takes as input the sequence of document embeddings sorted by the classification score assigned by a document classifier. This part of the network is composed of an LSTM that, by observing how the content of the documents varies (as represented in the document embeddings) in relation to the classification scores,

**Fig. 4.1** Architecture of the QuaNet network, from Esuli et al. (2018).

learns to output a "quantification embedding" which captures the composition of the whole set of documents to be quantified.

The second part of the network takes as input such quantification embedding vector as well as the estimated prevalence values from the ensemble of quantification methods, i.e., those described by Equations 4.2, 4.3, 4.4, 4.9 (i.e., *<sup>p</sup>*ˆCC *<sup>U</sup> (*⊕*)*, *<sup>p</sup>*ˆPCC *<sup>U</sup> (*⊕*)*, *<sup>p</sup><sup>h</sup> <sup>U</sup> (*⊕ˆ *i)*, *<sup>E</sup>*[*p<sup>h</sup> <sup>U</sup> (*⊕ˆ *i)*]), and other statistics on the underlying classifier (the TPR, FPR, TNR and FNR estimates, see Section 4.2.3). All these values are processed through a set of fully connected layers that output the quantification prediction. Given a training set of labelled documents, the training examples for QuaNet are samples of the training set sampled so as to cover all the possible prevalence values.

The rationale behind QuaNet is that the network learns to select, combine and correct the information coming from a committee of quantification methods, as a function of an abstract representation of the content of the set of documents to be quantified in order to produce a more accurate quantification across all the spectrum of possible prevalence values, whereas each single method maybe more accurate on some ranges than on others. The QuaNet network can in principle work with any classifier, and also the embeddings can have different origin and form, e.g., they can be either traditional bag-of-word sparse representations or dense representations produced by a language model or as the by-product of a classification NN (as done in Esuli et al. (2018)). Similarly, the committee of quantification methods given in input can be varied, with experiments from the original authors confirming the intuition that the richer the committee is, the better the results are.

## **4.3 Aggregative Methods Based on Special-Purpose Learners**

To date, most proposed methods explicitly addressed to quantification (Barranquero et al., 2013; Bella et al., 2010; Forman, 2005, 2006, 2008; Forman et al., 2006; Hopkins and King, 2010; Xue and Weiss, 2009) employ general-purpose supervised learning methods, i.e., address quantification by elaborating on the results returned by a general-purpose classifier. A different stance is taken by the works in this section, which propose the use of learning algorithms explicitly designed with quantification in mind. Said methods propose special optimisation criteria, which are devised to bring about good quantification performance under simple aggregation (i.e., Classify and Count). Tasche (2016) provides an interesting theoretical analysis on this approach, which he dubs *quantification without adjustment*, highlighting some inherent limitations.

## *4.3.1 Methods Based on Explicit Loss Minimisation*

In a position paper, Esuli and Sebastiani (2010b) suggest the use of an *explicit loss minimisation* approach to quantification, based on the idea of using a learning algorithm that is "aware" of the measure (a.k.a. "loss") used for evaluating quantification error, i.e., a learning algorithm that explicitly minimises that measure, whichever it may be. This is an implicit answer to the methods discussed in Section 4.2.5, which all attempt to address the undesired side effects of using learning algorithms that minimise Hamming loss (i.e., "vanilla" classification error), or proxies thereof.

The idea of using classifier-training algorithms capable of directly minimising the measure used for evaluating error is well-established in supervised learning. However, in the case of quantification, following this route is non-trivial, because the functions used for evaluating quantification (see Section 3) are inherently *nonlinear*, i.e., are such that the error on the set of unlabelled items may not be formulated as a linear combination of the error incurred by each unlabelled example. The reason for this inherent non-linearity is that, how the error on an individual unlabelled item impacts on the the error on the set of unlabelled items depends on how the other unlabelled items have been classified. For instance, if in the other unlabelled items there are more false positives than false negatives, an additional false negative is actually *beneficial* to overall quantification error, because of the mutual compensation effect between FP and FN mentioned in Section 1.2. As a result, a measure of quantification error is inherently nonlinear, and should thus be *multivariate*, i.e., take in consideration all the unlabelled items at once.

The assumption that the error on the set of unlabelled items may be formulated as a linear combination of the error incurred by each unlabelled example (as indeed happens for many common error measures – e.g., Hamming distance) underlies most existing learners, which are thus suboptimal for tackling quantification. In order to sidestep this problem, Esuli and Sebastiani (2010b) suggest the use of the *SVM for Multivariate Performance Measures* (SVMperf) learning algorithm proposed by Joachims (2005). SVMperf is a "structured output" learning algorithm of the Support Vector Machine family that can generate classifiers optimised for any nonlinear, multivariate loss function that can be computed from a contingency table (as all the measures discussed in Section 3 are). Esuli and Sebastiani (2014, 2015) implement and test the idea, adopting KLD or NKLD as the loss function to be minimised; the SVM(KLD) and SVM(NKLD) methods consist of adopting plain Classify and Count using a classifier generated by SVMperf as instantiated with the KLD or NKLD loss measures.

Barranquero et al. (2015) follow a very similar route, but instead of minimising a "pure" quantification loss they minimise (also via SVMperf) the *Q* measure, a combination of a classification loss function *M*<sup>c</sup> and a quantification loss function *M*<sup>q</sup> obtained (by mimicking the *Fβ* measure (van Rijsbergen, 1979)) as the harmonic mean between *M*<sup>c</sup> and *M*q, i.e.,

$$\mathcal{Q}^{M^{\mathbb{C}},M^{\mathbb{Q}}}\_{\beta} = (1 + \beta^2) \frac{M^{\mathbb{C}} \cdot M^{\mathbb{Q}}}{\beta^2 \cdot M^{\mathbb{C}} + M^{\mathbb{q}}} \tag{4.28}$$

The rationale of minimising the *Q* measure is that, by doing so, the authors attempt to learn a good quantifier that is also a good classifier, the underlying idea being that a system that delivers good quantification accuracy but bad classification accuracy is not a trustworthy quantifier. (This idea will be discussed again in Section 4.3.2). In their experiments, Barranquero et al. (2015) use (1−recall) and NAE as the classification and quantification loss functions *M*<sup>c</sup> and *M*<sup>q</sup> in Equation 4.28.

As an alternative to the use of SVMperf for minimising quantification loss measures, Kar et al. (2016) propose to tackle the explicit minimisation of KLD and other quantification loss measures via an online stochastic optimisation algorithm (*NEsted priMal-dual StochastIc updateS* – NEMSIS) that they devise. NEMSIS is an algorithm for online stochastic optimisation of nested concave functions, i.e., concave functions of functions that are themselves concave; Kar et al. (2016) show that −KLD (the negation of KLD) is indeed nested concave, which means that it lends itself to optimisation by means of NEMSIS. Via a similar process, the authors propose online stochastic optimisation algorithms that can deal with several other evaluation measures for quantification, including Barranquero et al.'s (2015) *Q* measure and variants thereof. Following the same line of research, Sanya et al. (2018) present a family of algorithms that can directly train deep neural networks, and other methods that generate nonlinear classifiers, to optimise quantification loss functions such as KLD.

## *4.3.2 Quantification Trees and Quantification Forests*

Another work that proposes the use of learning technology specially designed for quantification is the one by Milli et al. (2013). This work customises decision trees to deal with quantification, thereby yielding what the authors call *quantification trees*. Like all decision trees (see e.g., Duda et al., 2001, §8.2 for an introduction), quantification trees are built by recursively selecting the best feature for splitting the training data, until a stopping condition is verified. Essentially, a quantification tree is a kind of decision tree in which both (a) the splitting criterion and (b) the stopping condition are informed by measures of quantification accuracy. The authors propose two methods for training a quantification tree, which differ in terms of how step (a) is tackled.

In the first method (called *classification error balancing*), the |FP−FN| measure (a proxy of absolute error) is used for evaluating the quality of a split12. For instance, if nodes in the tree check for the presence or the absence of a feature in the unlabelled item (as in binary decisions trees), during the training phase the chosen feature on which to split is the one that minimises the absolute difference between the number of false positives and the number of false negatives resulting from the split.

In the second method (called *Classification-quantification balancing*), the quality of a split is evaluated by the function

$$\text{MOM}(p,\hat{p}) = |\text{FP}^2 - \text{FN}^2| = (\text{FP} + \text{FN})(|\text{FP} - \text{FN}|) \tag{4.29}$$

The rationale of MOM (which stands for *multi-objective measure*) is that *(*FP+FN*)* is a measure of *classification* error, while |FP − FN| is a measure of *quantification* error, which means that by minimising their product one attempts to generate low values of both quantities at the same time. The underlying intuition is that it is difficult to trust a quantifier if it is not also a good classifier, and by attempting to simultaneously maximise both classification and quantification accuracy we thus strive to obtain good and *trustworthy* quantifiers. (Such an attempt is also the rationale of Barranquero et al. (2015), a work that we have discussed in Section 4.3.1). The reader may have noticed that we had not mentioned MOM in Section 3.1, i.e, when discussing evaluation measures for quantification. The reason is that MOM, while a reasonable measure for a learner to optimise, is not a reasonable measure for evaluating the results of a quantifier, because it does not evaluate quantification error but a combination of quantification error and classification error.

Concerning step (b), Milli et al. (2013) stop growing the tree when no possible split would bring about an improvement in the chosen measure of quality (a measure which differs, of course, depending on which of the two methods above is used). Milli et al. (2013) take this approach further by proposing *Quantification Forests* (QFs). Essentially, a quantification forest is a "decision forest" (also known as "random forest" – see Criminisi et al. (2011) for an introduction) of quantification trees: a set of quantification trees is generated (each by restricting the training set to *k*<sup>1</sup> randomly chosen training documents and *k*<sup>2</sup> randomly chosen features), and the average of the prevalence estimates for class ⊕ is chosen as the final prevalence estimate *p(*ˆ ⊕*)*.

Note that quantification trees and quantification forests can be used either in their pure form (i.e., using a CC-style method as in Section 4.2.1) or, as Milli et al. (2013) indeed do, by applying the ACC-style correction of Section 4.2.3 to the class prevalence estimates they generate.

<sup>12</sup> The authors also mention the possibility of directly using KLD as a loss (i.e., as a measure of quality of the split), but do not present experiments on this.

# **4.4 Non-Aggregative Methods**

So far, we have discussed methods that work by aggregating the individual decisions that a (hard or soft) classifier takes for each and every unlabelled item, and possibly performing some post-processing. However, this is not the only possible route to quantification, and systems that estimate class prevalence values without generating binary decisions or posterior probabilities for the individual items as an intermediate step, can be conceived. Indeed, this route has a theoretical justification in the socalled *Vapnik's principle*, that states (Vapnik, 1998)

"If you possess a restricted amount of information for solving some problem, try to solve the problem directly and never solve a more general problem as an intermediate step. It is possible that the available information is sufficient for a direct solution but is insufficient for solving a more general intermediate problem."

This principle is directly applicable to quantification, since classification is a more general problem than quantification: in fact, if we have a (hard or soft) classifier we also have a quantifier, since in order to estimate class prevalence values we only need to apply (Probabilistic or non-) Classify and Count, but if we have a quantifier (i.e., an estimator of class prevalence values) this does not mean that we have a classifier. Vapnik's principle suggests that the information in the training set might be sufficient for solving quantification directly but not for solving it *indirectly*, i.e., for training a classifier that classifies the individual documents as an intermediate step. Non-aggregative quantifiers are thus the ones that more closely follow the spirit of Vapnik's principle; this section is devoted to such non-aggregative methods for learning to quantify.

## *4.4.1 The* **README** *Method*

The method proposed by King and Lu (2008), later named README and popularised by Hopkins and King (2010), is a text quantification method based on the idea of estimating class prevalence values directly via equation

$$p(\mathbf{x}\_{l}) = \sum\_{\mathbf{y}\_{l} \in \mathcal{Y}} p(\mathbf{x}\_{l}|\mathbf{y}\_{j}) p(\mathbf{y}\_{j}) \tag{4.30}$$

where *p(***x***i)* represents the probability that a document drawn at random from *U* has **x***<sup>i</sup>* as its vectorial representation. The problem is framed, using matrix notation, as

$$p\_U(\mathcal{X}) = p\_U(\mathcal{X}|\mathcal{Y})p\_U(\mathcal{Y}) \tag{4.31}$$

where *pU (X)* is a <sup>2</sup>*<sup>K</sup>* <sup>×</sup><sup>1</sup> vector whose elements are the probability of each possible variate (binary vector) of *<sup>K</sup>* features, *pU (X*|*Y)* is a <sup>2</sup>*<sup>K</sup>* × |*Y*<sup>|</sup> matrix where the *<sup>j</sup>* -th column has the class-conditional probabilities of all possible variates, and *pU (Y)* is the |*Y*| ×1 class prevalence array of interest; the solution to this equation can either be achieved by standard constrained least squares as

$$
\hat{p}\_U(\mathcal{Y}) = (p\_U(\mathcal{X}|\mathcal{Y})^\top p\_U(\mathcal{X}|\mathcal{Y}))^{-1} p\_U(\mathcal{X}|\mathcal{Y})^\top p\_U(\mathcal{X})\tag{4.32}
$$

and then replacing *pU (X*|*Y)* with *pL(X*|*Y)* under the assumption that the classconditional probabilities *p(***x***i*|*yj )* remain invariant between the training and unlabelled data13.

Of course, the problem is that in high-dimensional spaces (such as in the standard "bag-of-words" representation used in text-related applications), the dimension 2*<sup>K</sup>* affecting *pL(X*|*Y)* and *pU (X)* rapidly explodes, causing the method to become computationally intractable. To solve this issue, Hopkins and King (2010) applied bagging, i.e., repeatedly taking random subsets of features ("between approximately 5 and 25 words" long) and estimating *p(yj )* as the average of several runs. They also applied bootstrapping to re-sample matrix rows and estimate the method variance and thus deriving confidence intervals of the estimation.

Hopkins and King (2010) perform a small-scale experimentation (using 4 datasets, with sizes ranging from 462 to 4303 documents) in which their method is shown to outperform four baselines, each consisting of the CC method as applied to an SVM with a different kernel (linear, radial, polynomial, sigmoid). However, no details are given as to the number of subsets of the vectorial representation and size of these subsets used in these experiments. One drawback of this method is that it depends on several hyperparameters, e.g., the number of subsets of the vectorial representation, and the size of these subsets. Finding optimal values for these hyperparameters may thus require extensive cross-validation.

# *4.4.2 The iSA Method*

Although README (Hopkins and King, 2010) already counters some computational issues presented in the original version by King and Lu (2008), it still demands a considerable amount of computational power, mainly due to the application of bagging. Ceron et al. (2016) proposed a variant called iSA (standing for *integrated Sentiment Analysis*) that gets rid of the bagging approach by first applying a

<sup>13</sup> King and Lu (2008) argued this assumption to hold whenever the "data generation process" falls within the type *Y* → *X*, that is, when the class variable turns to condition the distribution of the variates in the feature space *X*. While this might stand true in their applicative scenario (verbal autopsies), where the causes of death might determine the symptoms, said assumption might not hold in general, nor be easily verifiable in practice. All other things being equal, and for reasons discussed in Section 1.5, we prefer not to stick to any dichotomy of quantification methods built on top of beliefs about data causality (thus embracing *data generation* considerations, *temporal* dependencies, or *intrinsic/extrinsic* judgements about labels).

series of transformations to the data instances and then directly solving (once) for Equation 4.32. The main transformation consists of artificially augmenting the number of instances by replacing each original data with simpler versions of it. Concretely, iSA replaces each *n*-dimensional document representation (within a bag-of-words model) with its *b* = *n/l* (non-overlapping) chunks of length *l* (with *l* a parameter to be specified by the user).14

## *4.4.3 The* **README2** *Method*

Jerzak et al. (2022) proposed README2 aiming at improving the performance of the original README system by Hopkins and King (2010). This improved version attempts to counter three situations that could degrade the performance of the original method, and that the authors identified as (i) *semantic change*, concerning the differences in meaning of language used across *L* and *U*, which can in turn be *emergent* (some terminology appears exclusively in *U*) or *vanishing* (some terminology appears exclusively in *L*); (ii) *lack of textual discrimination*, regarding those categories that are hardly distinguishable by the textual features; and (iii) *proportion divergence*, which is analogous to the prior probability shift between *L* and *U*.

README2 introduces two main novelties with respect to the former version, seeking to better represent the meaning of text. The first one consists of moving away from the sparse representation of the feature space and the subsampling procedure in favour of a dense representation based on word embeddings. The second one consists of improving the feature discrimination by learning a (feed-forward) neural transformation of the resulting matrix which is optimised for quantification. This transformation is formalised as an optimisation problem seeking to satisfy two desirable criteria for the new representation: *category distinctiveness* (the new features brings about more distant class-conditional means across categories), and *feature distinctiveness* (the rows of the transformed matrix present low correlation of one another). Both criteria are implemented as two different loss functions which define, as a weighted sum, the objective loss function to minimise.

Additionally, the aforementioned "vanishing" discourse effect is mitigated by subsampling *L* in a way that its term distribution gets closer to that in *U*. Jerzak et al. (2022) observed that selective pruning of *L* indirectly helped to reduce the "proportion divergence" with respect to *U*.

<sup>14</sup> Actually of length *(l* <sup>+</sup> 1), since a positional character informing of the chunk's order in the original sequence is added.

# *4.4.4 The HDx Method*

González-Castro et al. (2013) propose a quantification method for binary problems based on distributional divergence as measured via the Hellinger Distance (HD). The method, referred to as HDx, applies to scenarios in which *p(***x**|⊕*)* is assumed to be fixed but *p(*⊕*)* may vary.

This method is closely related to HDy and other mixture models (Section 4.2.8), with the difference of considering probability distributions *f (***x***)* over the multidimensional input domain *X*, instead of distributions *f (s(***x***))* over single-dimensional scores computed from the input. The rationale is to measure the similarity between the unlabelled distribution and a *validation distribution*, which is generated from the training distribution at a controlled prevalence. HDx iteratively varies this prevalence at small steps ranging from 0 to 1 and seeks the prevalence that maximises the match with the unlabelled distribution as a Mixture Model.

HDx measures the distributional divergence between input data **x**, as represented in a feature space (e.g., tfidf values) between two distributions *f* and *g* via the HD, defined as

$$\text{HD}(f, \mathbf{g}) = \sqrt{\int \left(\sqrt{f(\mathbf{x})} - \sqrt{g(\mathbf{x})}\right)^2 dx} \tag{4.33}$$

The method they propose actually computes this divergence by integrating the HD between each feature distribution independently, which is discretised using bins. The integral is approximated by summing over the bins, i.e.,

$$\text{HD}(V, U) = \frac{1}{n} \sum\_{f=1}^{n} \left\lfloor \sum\_{i=1}^{b} \left( \sqrt{\frac{|V\_{fi}|}{|V|}} - \sqrt{\frac{|U\_{fi}|}{|U|}} \right)^2 \right\rfloor \tag{4.34}$$

where *V* is the validation sample, |*Vf i*| is the number of times the feature *f* appears in the bin *i*, *n* is the number of features (e.g., distinct terms), and *b* is the number of bins. The method thus consist of returning

$$\alpha^\* = \arg\min\_{\alpha \in [0,1]} \text{HD}(V^{\alpha}, U) \tag{4.35}$$

with *α*<sup>∗</sup> the prevalence. Actually, *V <sup>α</sup>* is created neither by over- nor by undersampling *L*, but is instead constructed as a mixture of the class-conditional distributions parameterised with the desired prevalence *α*, i.e.,

$$V^{\mu}(\mathbf{x}) = \alpha \cdot p(\mathbf{x}|\oplus) + (1 - \alpha) \cdot p(\mathbf{x}|\ominus) \tag{4.36}$$

Since the number of bins *b* might have a significant impact in the calculation, one typically returns the median of the distribution of the best *α*'s found for a range of *b*'s (typical values are *b* ∈ [10*,* 20*,* 30*,...,* 110]).

The same authors also propose HDy, previously discussed in Section 4.2.8, which, contrarily to HDx, measures the divergence in a single-dimensional space, which represents the codomain of a soft classifier *s(***x***)*. The fact that HDy relies on a soft classifier to model *p(*⊕|**x***)* precludes it from being considered a pure nonaggregative method. Notice HDy significantly outperforms HDx in the experimental evaluation conducted in González-Castro et al. (2013).

## *4.4.5 The MMD-RKHS Method*

Iyer et al. (2014) formulate the quantification problem in terms of minimising the Maximum Mean Discrepancy measure in a Reproducing Kernel Hilbert Space (MMD-RKHS). They prove some error bounds on the application of MMD to quantification and use such theoretical results to define a kernel learning method that minimises the MMD between the observed *pU (***x***)* and *<sup>y</sup> pL(***x**|*y)y*, under the assumption that *pU (***x**|*y)* = *pL(***x**|*y)*, where *y* are the unknown prevalence values to be estimated. They compare MMD-RKHS against the method of du Plessis and Sugiyama (2012), obtaining similar or slightly better results.

## *4.4.6 The Uncertainty-Aware Generative Model*

Keith and O'Connor (2018) propose a Generative Probabilistic Modelling (GPM) approach to prevalence estimation. The proposed method directly conducts inference for the unknown prevalence and caters for confidence intervals (CIs) inference. CIs aim to capture the uncertainty of the model in providing an accurate class prevalence prediction (i.e., the more confident the model is about its prevalence estimation, the narrower the CI, and vice versa). This is the first quantification method in the literature that directly models uncertainty in terms of CIs.<sup>15</sup> CI are later discussed in more detail in Section 5.8.

The idea explored in Keith and O'Connor (2018) is to learn a generative probabilistic model that, by assuming (i) the documents be conditionally dependent on the label (i.e., the data generation process is of the form *Y* → *X*, see Section 1.5), and (ii) that the class-conditional (unigram) language models remain invariant between training and unlabelled distributions, proceeds by first sampling a prior class distribution , then sampling a label *yi* ∼ Bernoulli*()* for each document,

<sup>15</sup> Note that, although CIs were already mentioned in the work of Hopkins and King (2010), their method (README – see 4.4.1) is not properly probabilistic, and CIs were obtained via bootstrap.

and finally sampling a bag-of-words document *xi* ∼ Multinomial*(φyi)* conditioned on the label. Different methods are explored as alternatives for the language model determining *φyi* . In particular, two *explicit* ones (Multinomial Naive Bayes and Loglin) that directly model *p(***x**|*y)*, and another *implicit* (LR-Implicit) that instead estimates said class-conditional *p(***x**|*y)* via the posteriors *p(y*|**x***)* generated by a discriminative classifier (a logistic regressor). The optimal prevalence (along with the CI) for a set of unlabelled items is then sought by simply exploring a grid of possible values and returning the one maximising the marginal log probability of all unlabelled documents.

Among the variants explored, LR-Implicit yields the best results in terms of MAE (for natural and artificial training prevalence values) and CI coverage (the proportion of times the CI*α*=0*.*<sup>1</sup> happens to contain the true class prevalence).

Technically, a generative model equipped with a discriminative classifier as a proxy for computing *p(***x**|*y)* from the posteriors *p(y*|**x***)* might better fit within the family of aggregative methods (discussed in Section 4.2) (indeed, the authors discuss the close connections between LR-Implicit and the aggregative SLD method of Saerens et al. (2002), see Section 4.2.9). However, the fact that the general generative framework described in Keith and O'Connor (2018) does only require the specification of a language model conditioned on the class labels (as directly attained by the *explicit* variants), squarely places the approach within the nonaggregative methods.

# *4.4.7 Deep Quantification Network*

Qi et al. (2020) propose a Deep Quantification Network (DQN) that makes quantification predictions by combining quantification predictions made on samples from the test set to be quantified. More specifically, the training set *L* of labelled objects (binary or multi-class labels), is split, by sampling without replacement, in |*L*| *<sup>m</sup> m-tuples* of *m* objects (e.g., 100 objects). The training examples for DQN are thus the *m*-tuples with their prevalence values, as determined by the labels assigned to the elements in the tuple. The sampling policy that generates the *m*-tuples is a parameter of the method. The authors tested two sampling methods: random sampling, which may have low variance in the prevalence values of the *m*-tuples, thus risking to overfit DQN on the prevalence of the training set. The other sampling method is based on the Zipf distribution, which produces samples that exhibit more varied prevalence values, aimed at contrasting overfitting to the prevalence of the training set.

A set of *m*-tuples generated using the whole training set defines an *epoch* of the training process. Many set of *m*-tuples, and thus many training epochs, are used to train the DQN.

The DQN is composed of three main components, chained one after the other:


DQN is thus a feature extraction network followed by one (if CON, AVG, MIN, MAX components are used) or two (if NN is used) dense layers that convert the feature vector into prevalence estimations.

At test time the test size is split in *m*-tuples in the same way as the training data, prevalence predictions are collected for every *m*-tuple and averaged. Similarly to the training process, more than one split can be generated for a test set. In this case the various prevalence predictions from the different splits are averaged to produce the final prediction for the test set.

Qi et al. (2020) tested their method on binary (IMDb) and multi-class text datasets up to 20 labels (20 Newsgroups (Lang, 1995)), comparing it against CC, PCC, ACC, PACC, and ReadMe (King and Lu, 2008). The configuration using Zipf-based sampling and NN as the *m*-tuple feature extraction component always performed better than any other configuration, and better by a 45% on average, measured in terms of MAE reduction, than any of the compared baselines.

A key difference between DQN and QuaNet (Section 4.2.12) is that DQN directly tackles the quantification problem without leveraging on a classification method. In QuaNet, the document embeddings and classification scores are based on solving a classification problem. In DQN, all vector representations, including the feature vector representation for a single item in an *m*-tuple, are learned during the end-toend learning<sup>16</sup> process that aims at quantification.

<sup>16</sup> In deep learning, the expression "end-to-end learning" indicates that all the parameters of a possibly complex and deep network are fitted at the same time during a single training phase, considering the whole network as a single model. This is in contrast to other training approaches in which some neural models are regarded as pre-trained models, and that typically consist of either training (only) a set of additional layers, or modules, stacked on top of the pre-trained model, or performing fine-tuning of the pre-trained model using the dataset at hand.

It is worth noting that setting the *m*-tuple size to the extreme, and rather odd, value of one transforms DQN into an aggregative method based on classification. When *m*-tuple size is one, any *m*-tuple can only have a prevalence of either one or zero, i.e., coinciding with the classification label of the single item it contains. The DQN thus classifies every single items and then outputs its prevalence estimate by looking at the set of classification scores. For any other *m*-tuple size value larger than one DQN can be considered a non-aggregative method.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 5 Advanced Topics**

## **5.1 Ordinal Quantification**

A special case of single-label classification is the ordinal one, in which the *m >* 2 classes are arranged in a total order. In this case, classes define a discrete, typically non-metric, qualitative scale. An example of this is the star rating model of product reviews, which is a typical problem faced in sentiment analysis. The sentiment scenario is one that highlights how quantification fits well with ordinal problems, as the typical use of ordinal ratings is to observe how the aggregated evaluations distribute among the various grades.

It is straightforward to observe that any quantification method for the SLQ case (see Section 4) can be applied to the ordinal case, and also that this approach is likely suboptimal as it does not take advantage of the total order among classes. Esuli and Sebastiani (2010b) discussed the scenario of ordinal quantification, and proposed an evaluation measure for it (see Section 3.2). The 2016 SemEval challenge proposed an ordinal quantification task (Nakov et al., 2016) that collected ten submissions from participants. Among them, only two submissions were based on methods specifically designed for the ordinal quantification task.

The method proposed by Da San Martino et al. (2016a,b), winner of the challenge, builds a binary tree from a set of binary classifiers trained on *(m* − 1*)* split points of the ordinal scale. For example, when *m* = 5, four binary classifiers are trained: one that classifies elements in {*y*1} from the elements in {*y*2*, y*3*, y*4*, y*5}, and three other for the {*y*1*, y*2} vs {*y*3*, y*4*, y*5}, {*y*1*, y*2*, y*3} vs {*y*4*, y*5}, and {*y*1*, y*2*, y*3*, y*4} vs {*y*5} splits. All the binary classifiers are corrected for quantification by applying PCC (see Section 4.2.2). The root node of the tree structure is determined by selecting the binary classifiers that has the smallest quantification error, measured via KLD. Subsequent nodes of the tree are determined recursively on the subsets of classifiers selected by the split of the parent node, until a split selects a single classifier. Quantification is performed by accumulating posterior probabilities for each element in the set of unlabelled items with respect to each category. The posterior probability for an element with respect to a category is defined by the product of the probabilities in the path of the binary tree the goes from the root to the leaf associated with that category.

Esuli (2016) proposed a similar approach, in which a binary tree of classifiers is built on split points of the ordinal scale. The difference with the previous approach lies in the criterion used to define the tree, which in this work is based on selecting for the root (and then recursively for any other subtree) the split point that produces the most balanced training set, adopting the heuristic that quantification method may perform better on balanced dataset rather than unbalanced ones. For example, for on a ordinal scale that has labels {*y*1*, y*2*, y*3*, y*4}, respectively with 40, 20, 10 and 10 training examples, the best split for the root of the tree is {*y*1} vs {*y*2*, y*3*, y*4}, as it produces a 50-50% split of the examples. The method of Da San Martino et al. (2016a), which is based on the actual evaluation of the quantification accuracy to define the binary tree, experimentally proved superior to the one of Esuli (2016).

## **5.2 Regression Quantification**

Aggregative approaches can provide useful results also in applications where regression (not classification) is the task at hand for single data points. In a foundational work with little follow-up (Bella et al., 2014), the problem of quantification for regression is outlined, aimed at estimating composite quantities such as sales, quantities of consumed goods, or overall duration.

The authors provide a supporting sample application: "Consider a maternity ward that has collected data about baby weight at birth (dependent variable) for risk pregnancies, jointly with several features about the mother and her current and previous pregnancies (input variables). With these (training) data, a regression model has been trained in order to predict baby weight. In order to better plan the resources needed and the number of expected complications, the hospital wants to estimate the distribution of weight births for the following month, according to a new group of pregnant women (unlabelled data) that the maternity ward is monitoring for future deliveries."

Let *y* denote the dependent variable, as customary in regression settings. As a key aggregated value to quantify is the average of the dependent variable over a sample *U* of unlabelled items is considered. A first trivial solution is proposed by computing

$$
\hat{\mu}\_U = \mu\_L \tag{5.1}
$$

i.e., the regression counterpart of Maximum Likelihood Prevalence Estimation (Section 4.1), dubbed *Test to Train* (TT). As usual, *L* represents the labelled (training) set, and *U* the unlabelled (test) set.

Another solution which neglects dataset shift (Section 1.5) and performs simple aggregation of individual estimates, is dubbed *Regress and Sum* (RSu), and corre-

#### 5.2 Regression Quantification 89

sponds to computing

$$
\hat{\mu}\_U = \frac{\sum\_{l=1}^{|U|} \hat{\mathbf{y}}\_l}{|U|} \tag{5.2}
$$

where *y*ˆ represents the estimate provided by a regression model trained on *L*. This estimate is clearly reminiscent of Classify and Count. RSu estimates are likely to suffer from potential weakness of the underlying regression model, typically trained via minimisation of mean square error on *L*. The authors argue that quadratic loss functions discourage predictions far from the mean, thus bringing about more packed predictions.

This may be acceptable if we are only interested in a single value or indicator such as the mean *μ*ˆ *<sup>U</sup>* , but becomes more of an issue if we are interested in estimating a full probability distribution for the output value *y*, a different and fully legitimate task in the realm of quantification for regression. To exemplify, the counterpart of RSu for this task can be computed as

$$\hat{P}\_U(\mathbf{y} \le r) = \frac{\sum\_{l=1}^{|U|} \mathbb{1}(\hat{\mathbf{y}}\_l \le r)}{|U|} \tag{5.3}$$

where <sup>1</sup>*(*·*)* is the indicator function. This method is dubbed *Regress and Splice* (RSp).

A further drawback of RSu is the inheritance of bias from its underlying regression model, which can be non-zero even in the absence of dataset shift. The authors propose three heuristics designed to reduce the impact of the abovementioned issues thus improving aggregate quantification:

• *Adjustment* is aimed at compensating for bias, as estimated on *L*. This leads to a method dubbed *Adjusted Regress and Sum* (ARS), summarised by the formula

$$
\hat{\mu}\_{U}^{ARS} = \hat{\mu}\_{U}^{RSu} + \alpha B\_{L}^{RSu} \tag{5.4}
$$

Here *BRSu <sup>L</sup>* is the bias of the RSu estimate computed on *L*, and *α* represents a modulating factor optimised empirically.

• *Segmentation* responds to the need for different adjustments across regions of the input space. In other words, it is reasonable to expect that the bias of a regression model will be region-dependent, bringing about systematic underestimation in some areas, while overestimating elsewhere. A number of thresholds are suitably defined for *y*, based on values taken by *y* in the training set *L*. Predictions *y*ˆ issued on *U* are binned according to these thresholds, approximated with a value deemed representative of the respective bin, and adjusted in a bin-dependent way. More in detail, the computation is the result of the following steps:


Individual predictions are thus corrected according to formula

$$
\hat{\mathbf{y}} = \hat{\mathbf{y}}\_{j}^{m} + \alpha \mathbf{B}\_{j} \tag{5.5}
$$

where bin membership is denoted by subscript *j* . Finally, *μ*ˆ *<sup>U</sup>* is computed as the average of predictions over *U*.

• *Spreading* is aimed at counteracting the compression of predictions *y*ˆ, brought about by regression models which have a tendency to produce packed outputs. For this reason, estimates *y*ˆ are corrected via the Nadaraya-Watson kernel as a first step. This kernel smoothing algorithm allows to artificially increase the variance of predicted values to better match the variance of the real values *y* when required. Spreading can be used in conjunction with all techniques described above, including TT, RSu and ARS. It is deemed especially useful when the task at hand requires an estimate of the whole probability density, less so when the interest lies in the average value *μ*ˆ *<sup>U</sup>* .

## **5.3 Cross-Lingual Quantification**

*Cross-Lingual Quantification* (CLQ) is the task of performing quantification in scenarios in which training documents in the target language for which quantification needs to be performed do not exist (or are too few as to deploy a reliable quantifier) but exist for a different source language. Additionally, large quantities of unlabelled documents are assumed to be easily accessible for both domains. Esuli et al. (2020) formally defined the task and proposed preliminary baselines for binary sentiment classification. The key observation is that, when performed via aggregative methods, cross-lingual quantification could be directly enabled via the combination of cross-lingual classification and quantification correction. In Esuli et al. (2020), *Cross-lingual Structural Correspondence Learning* (Prettenhofer and Stein, 2011) and *Distributional Correspondence Indexing* (Moreo et al., 2016), two methods capable of generating cross-lingual vectorial representations (i.e., in a language-agnostic vector space), were used to train (general purpose) classifiers and tested in combination with CC, PCC, ACC, PACC, and QuaNet (discussed in Section 4.2).

Note that CLQ is an instance of *transfer learning* (Pan and Yang, 2010), the general learning framework dealing with differences in data distribution and data representation between the source and the target domains. Other variants of transfer learning (e.g., cross-domain text quantification) remain, to the best of our knowledge, unexplored. We are likewise unaware of more general CLQ methods tackling quantification by topic (instead of by sentiment), dealing with multi-class problems (instead of binary), or adopting non-aggregative approaches (that is, without relying on cross-lingual classification as an intermediate step).

## **5.4 Quantification for Networked Data**

Networked data quantification is a special quantification setting where a network structure connects the individual unlabelled items, as is the case e.g., with hyperlinked web pages. In classification, the presence of hyperlinks allows the use of supervised ("relational") learning techniques that leverage both endogenous features (e.g., textual content) and exogenous features (e.g., hyperlinks and/or the labels of neighbouring items) (Chakrabarti et al., 1998; Macskassy and Provost, 2007). The term "collective classification" (see also Section 6.4) is often used to denote the fact that the classification of networked items is best tackled collectively, and not for each item in isolation of the others, since the label to be assigned to one item may influence the label to be assigned to another item. This is consistent with homophily effects and preferential attachment often seen in networked data. So, one obvious method of performing relational quantification is using a stateof-the-art collective classification algorithm and correcting the resulting prevalence estimates via method ACC (or Method Max, Method X, T50, MS, MM). Tang et al. (2010) follow this route by using the wvRN algorithm of Macskassy and Provost (2003) as the collective classification algorithm. However, they further propose a non-aggregative method called *Link-Based Quantification* (LBQ), inspired by the ACC method of Section 4.2.3. Let *p(ik)* denote the fraction of nodes in the network that link to node *<sup>i</sup>* with *(k*−1*)* levels of indirection (so that, e.g., *p(i*1*)* is the fraction of nodes in the network that directly link to node *i*). From the law of total probability it follows that

$$p(\tilde{i}^k) = p(\tilde{i}^k|\oplus) \cdot p\_U(\oplus) + p(\tilde{i}^k|\ominus) \cdot (1 - p\_U(\oplus))\tag{5.6}$$

entailing

$$p\_U(\oplus) = \frac{p(\vec{i}^k) - p(\vec{i}^k|\ominus)}{p(\vec{i}^k|\ominus) - p(\vec{i}^k|\ominus)}\tag{5.7}$$

Equation 5.7 allows estimating *pU (*⊕*)*, since the value of *p(ik)* can be observed directly in the network, while the values of *p(ik*|⊕*)* and *p(ik*|*)* can be estimated from a training set. A different estimate *p*ˆ *(i,k) <sup>U</sup> (*⊕*)* of *pU (*⊕*)* can be obtained for each pair *(i, k)* composed of a node *i* in the network and an integer value of *k*. In order to obtain a robust estimate, the authors compute all estimates for *k* ∈ [1*, kmax*] (for a given *kmax*), and use their median as the final estimate *p*ˆ*<sup>U</sup> (*⊕*)*. Quantification based on homophily is further explored in Milli et al. (2015). A community detection algorithm is run on the whole network graph (comprising elements from *U* and *L*). Each node in *U* is subsequently assigned the most frequent label from nodes in its community belonging to *L*. In case of community overlap, a prevailing one is identified based on its density or on highest class prevalence within the community. Alternatively, ego-networks are proposed as a way to define the community of a given node. Given a node's neighbourhood (nodes directly or k-hop-connected to it), its missing label is determined as the majority one in the neighbourhood.

After label assignment is carried out, Classify and Count and Adjusted Classify and Count are employed as strategies to aggregate the results. For the latter, false positive rates and true positive rates are estimated on *L* with a leave-one-out approach.

## **5.5 Cost Quantification**

A specific flavour of quantification has been tackled by Forman (2006, 2008) and dubbed *cost quantification*. For this application, each data point comes with additional cost information associated to it. A key application is represented by a business looking for insight into warranty costs for its products. Given a set of customer support logs, comprising textual data about issues described by customers and the cost of support (e.g., repairs), we are interested in quantifying how much each type of issue is contributing to after-sales expenses. Classes are represented by different issues or any atomic feature that might drive quality assurance decisions for the business, e.g., CrackedScreen or SwollenBattery. This task is trivially resolved by a quantifier if the average cost for a given issue is fixed and known in advance. However, a further source of complexity is often introduced due to variability of prices for components.

*Classify and Total* (CT), is the simplest algorithm considered. Being the counterpart of Classify and Count, it is based on running a classifier on each sample from *U* and adding up the cost *c(***x***)* associated to each sample labelled as belonging to the class of interest, which comes down to computing

$$S\_{\mathbf{y}} = \sum\_{\mathbf{x} \in U : h(\mathbf{x}) = \mathbf{y}} c(\mathbf{x}) \tag{5.8}$$

This approach has similar limitations to Classify and Count.

*Grossed-Up Total* (GUT) mitigates this problem by pushing the CT estimate *Sy* upwards or downwards according to the ratio between the class prevalence estimate by a proper quantifier *Mq* and the one provided by the classifier employed, i.e.,

$$S\_{\mathbf{y}}^{\prime} = S\_{\mathbf{y}} \times \frac{\hat{P}\_{U}^{M\_{q}}(\mathbf{y})}{\frac{1}{|U|} \sum\_{\mathbf{x} \in U} \mathbb{1} \left(h(\mathbf{x}) = \mathbf{y}\right)}\tag{5.9}$$

which can be rewritten as

$$S\_\mathbf{y}' = \hat{p}\_U^{M\_\mathbf{q}}(\mathbf{y})|U| \times \frac{S\_\mathbf{y}}{\sum\_{\mathbf{x} \in U} \mathbb{1}(h(\mathbf{x}) = \mathbf{y})} \tag{5.10}$$

thus making two factors explicit. The first represents an estimate of cardinality for class *y* within *U* given by quantifier *Mq* , while the second one can be interpreted as a best guess of average cost for class *y* provided by classifier *h(***x***)*, which, however, is quite likely to be polluted by misclassified items.

*Conservative Average* \* *Quantifier* (CAQ) is aimed at reducing pollution by computing a cost average on a predefined amount of items from *U*, which we deem very likely to belong to class *y*. These items are taken in decreasing order of posterior probability *p(y*|**x***)*.

*Precision Corrected Average* \* *Quantifier* (PCAQ) takes the above idea a step further by estimating the precision (or Positive Predictive Value – PPV) of classifier *h(***x***)* on the unlabelled set *U*. For ease of notation, in the binary case, let us shorten the symbol for estimates of prevalence for class ⊕ within *U* provided by quantifier *Mq* to *q* = ˆ*p Mq <sup>U</sup> (*⊕*)*. Moreover, let PPV*<sup>h</sup>* denote the precision of classifier *h(***x***)* on *U*. The values of PPV*<sup>h</sup>* on *U* can be computed from estimates of class prevalence *q* and estimates of true and false positive rates for *h(***x***)* (TPR*h*, FPR*h*), obtained via cross-validation on *L*, i.e.,

$$\text{PPV}\_h = \frac{q \cdot \text{TPR}\_h}{q \cdot \text{TPR}\_h + (1 - q) \cdot \text{FPR}\_h} \tag{5.11}$$

This value is then employed to compute the average cost of positive predicted instances via

$$\mathbf{C}\_{\oplus}^{h} = \mathbf{P} \mathbf{P} \mathbf{V}\_{h} \mathbf{C}\_{\oplus} + (1 - \mathbf{P} \mathbf{P} \mathbf{V}\_{h}) \mathbf{C}\_{\ominus} \tag{5.12}$$

where *C*<sup>⊕</sup> is the average cost of items in class ⊕, which we need to estimate. A further equation linking these quantities can be specified on the whole set *U* of unlabelled items, i.e.,

$$C\_U = p\_U(\oplus)C\_\oplus + (1 - p\_U(\oplus))C\_\ominus \tag{5.13}$$

where *CU* is the average cost of items in *U*. After solving for *C*, plugging into Equation 5.12, and substituting *pU (*⊕*)* with its estimate *q*, we obtain

$$C\_{\oplus} = \frac{(1 - q)C\_{\oplus}^{h} - (1 - \text{PPV}\_{h})C\_{U}}{\text{PPV}\_{h} - q} \tag{5.14}$$

which is then multiplied by estimated class cardinality *q* · |*U*| to get the final cost quantification. Note that both estimates of classifier precision PPV*h* and average cost *C<sup>h</sup>* <sup>⊕</sup> depend on how the classifier's threshold is selected.

*Median Sweep of PCAQ* applies the philosophy of Median Sweep from Section 4.2.6 to PCAQ by considering several values for classifier threshold, getting a different estimate *Cy* for each of them via PCAQ, and regarding their median as a final estimate.

*Mixture Model Average* \* *Quantifier* applies a similar idea directly to Equation 5.12. By letting threshold *t* vary we obtain

$$\frac{\mathcal{C}\_{\oplus}^{l}}{\mathbf{PPV}^{l}} = C\_{\oplus} + C\_{\ominus} \frac{1 - \mathbf{PPV}^{l}}{\mathbf{PPV}^{l}} \tag{5.15}$$

i.e., a system of equations, one for each threshold value, which can be solved for *C*⊕, *C* via linear regression.

Note that these methods approximate the values of TPR and FPR on the unlabelled set *U* with estimates computed via cross-validation on *L*, which may be a bad approximation unless *pL(***x**|*y)* = *pU (***x**|*y)*, i.e., unless *L* and *U* are connected by prior probability shift.

## **5.6 Quantification in Data Streams**

Yang and Zhou (2008) consider the problem of estimating the shift in prior distribution while observing a sequence of objects from a stream. Their aim is to improve the classification accuracy by using shift updated priors in the classification model that is trained only once at the beginning of the process, i.e., without resorting to active learning and retraining. The proposed method adapts the EM method of Saerens et al. (2002) to work from a batch setup, i.e., estimating new priors for a set of unlabelled objects, to an online setup, i.e., correcting priors every time a new object appears in the stream. Differently from the method by Saerens et al. (2002), the Online EM (OEM) method of Yang and Zhou (2008) applies the E and M steps only once to each element that is sequentially generated by the stream. The initial priors, as well as the likelihood function, are computed on a training set. The E step computes the posteriors probabilities of the *k*-th element of the sequence **x**<sup>1</sup> *...* **x***<sup>n</sup>* of elements of the set *U* of unlabelled items using the likelihood function and the priors for the *k*-th step, similarly to the method by Saerens et al. (2002). The M step computes the corrected priors for the next *k* + 1 element of the sequence using an exponential forget function that combines the priors of the *k*-th step with the posteriors of the *k*-th element, i.e.,

$$
\hat{p}\_{k+1}(\mathbf{y}) = \alpha \hat{p}\_k(\mathbf{y}|\mathbf{x}\_k) + (1 - \alpha)\hat{p}\_k(\mathbf{y}) \tag{5.16}
$$

The OEM method is thus an online quantification method in the strict sense of online processing, as each element of the sequence is observed and processed only once.

In experiments OEM performs better than the original EM at improving the classification accuracy, yet the actual priors' estimation are not very accurate. Zhang and Zhou (2010) observed that this issue is likely related to a small-sample effect, i.e., that priors update in Equation 5.16 is determined by a single element. They propose to overcome this issue by means of a transfer estimation method, which computes the M step using the posteriors from *N* previous elements in the stream, i.e, Equation 5.16 is changed into

$$
\hat{p}\_{k+1}(\mathbf{y}) = \alpha \frac{1}{N} \sum\_{l=0}^{N-1} \hat{p}\_k(\mathbf{y}|\mathbf{x}\_{k-l}) + (1 - \alpha)\hat{p}\_k(\mathbf{y}) \tag{5.17}
$$

Maletzke et al. (2018) explore the use of active learning on data streams as a device to improve the quantification accuracy. They define data streams as generators of instances across time. For quantification, they consider *U* to be composed of a sequence of event windows *Ut* across time. Quantification requests happen whenever an event window is complete. The true label *y* is known for an initial batch of instances that define the training set *L*. The true label for successive instances may be available after a verification latency time *Tl*, which may range from *Tl* = 0 to *Tl* = ∞. The first case means that, if requested, the true label for an instance is immediately available. This is an unrealistic case for most real-world applications as some time is inevitably required by the labelling oracle, typically a human annotator, to produce the true labels. The latter case of *Tl* = ∞ means that no true labels will be ever available for instances outside the training set, which is an extreme scenario in which no active learning strategy can be applied. Active learning can be exploited in all the cases for which *Tl <* ∞, exploring many possible strategies and trade-offs between labelling cost and quantification accuracy improvement.

The methods proposed by Maletzke et al. (2017, 2018) are template methods as they leverage a classification-based method to perform the actual quantification, while they manipulate the training data (transforming or enriching it).

The *Stream Quantification by Score Inspection* (SQSI) algorithm (Maletzke et al., 2017) leverages statistical tests to decide if a classifier trained on *L* can be reliably used to perform classification and quantification on *Ut* . The algorithm works as follows:

	- (a) If the null hypothesis is not rejected, a quantification method based on *h* is used to estimate class prevalence on *Ut* . The algorithm repeats from Step 2 for the successive set *Ut*+1.
	- (b) Otherwise, the algorithm makes a first attempt at transforming *L* into a shift adapted training set *L* using the shift adaptation algorithm described in dos Reis et al. (2016).
	- (a) If the null hypothesis is not rejected, a quantification method based on *h* is used to estimate class prevalence on *Ut* . *L* replaces *L* and the algorithm repeats from Step 2 for the successive set *Ut*+1.
	- (b) If the null hypothesis is rejected again then the true labels of *Ut* are asked to an oracle, defining a new training set *L*. The algorithm repeats from Step 1 for the successive set *Ut*+<sup>1</sup>

Assuming a small shift between successive sets of items *Ut, Ut*+<sup>1</sup> one can expect that the oracle will seldom be consulted. In the experimental evaluation of Maletzke et al. (2017), performed on fourteen datasets with a very low number of features (only two features for 8 synthetic datasets, and less than 100 in the other cases), the portion of items labelled by the oracle was below 10% in all but one case.

The SQSI algorithm can help the quantification process only when the observed shift is within the range of correction of the shift adaptation method, otherwise it fails, requiring a complete labelling of the set of items to be quantified by the oracle. The SQSI-IS (where IS stands for Instance Selection) algorithm tries to reduce the amount of labelling required by using instance selection and self-learning whenever the shift adaptation method fails. Instead of requiring the oracle to label the whole set *U* (Step 5b above), only a fraction of elements of *U* is selected for labelling by the oracle, while the remaining part is labelled using an iterative process of self-learning adding to *L* the element of *U* \ *L* that is classified with the highest confidence. The authors test several instance selection methods (random, clustering based, farthest-first traversal), and find that a clustering-based approach performs consistently better, with the best overall quantification performance observed for SQSI-IS instantiated with clustering and the PCC quantification method.<sup>1</sup> The observed reduction in labelling requests from SQSI to SQSI-IS is 50% on average, while achieving the same quantification performance.

<sup>1</sup> Maletzke et al. (2018) tested CC, PCC and ACC as the base quantification methods.

## **5.7 One-Class Quantification**

A one-class classification problem assumes that the labelled examples are all positive examples of a single class, and that no negative examples are available. Performing quantification in the one-class case is challenging because it is not possible to measure a real prevalence on the training set *L*. Moreover, for quantification methods that rely on classification, also the one-class classification scenario is obviously a harder problem than the traditional classification scenario in which one has representative examples of both the positive and the negative classes.

Nonetheless, approaching a quantification problem as a one-class quantification problem may be a more robust approach in cases in which the definition of the negative cases is open. In a one-class setup the positive label will likely identify a specific property while the negative label comprises the universe of data points for which such property does not hold. In this case is it thus hard to have the domain of negative examples properly represented in the training set. The domain of negative examples may change considerably after training the quantification model. For example, one may be interested in training a Sports news quantifier, having as negative example only news about Health. The trained quantifier may be then applied to datasets that include news about Economics and Politics. In this scenario, a one-class quantifier, trained only on positive examples for Sports, may be more robust to the variation of data composition between the training phase and the deployment phase.

Moreira dos Reis et al. (2018a) propose two methods for one-class quantification, the *Passive Aggressive Threshold ACC* (PAT-ACC) the *One Distribution Inside* (ODIn) method, which draws inspiration from the MM approach (Forman, 2008, see Section 4.2.8). Both methods are designed to work in combination with oneclass classifiers.

PAT-ACC extends ACC to work on one-class problems by observing that the problem of estimating FPR can be circumvented by choosing a conservative classification threshold, so that one can assume that FPR ≈ 0. If the classification threshold is set so that a quantile *q* of observations is classified as positive, then the TPR can be estimated as TPR = 1−*q*, allowing to perform quantification using the ACC method (see Equation 4.5), i.e.,

$$
\hat{p}\_U^{\text{PAT}-\text{ACC}}(\oplus) = \min\left(1, \frac{p\_U(h(\oplus))}{(1-q)}\right) \tag{5.18}
$$

Moreira dos Reis et al. (2018a) claim that the PAT-ACC method is not sensitive to the value of *q* and report that a value of *q* = 0*.*25 is a generally good choice. They also suggest that an approach similar to Median Sweep can be adopted to avoid using a fixed *q* value.

The ODIn method compares the score distribution that is available only for positive examples in the case of the training set *L* with the score distribution for *U*, which includes both negative and positive examples. Scores from the classification of *L* are used to define a variable-width histogram *H<sup>L</sup>* in which each bin has the same number of elements. The number of bins *b* is a parameter, which in Moreira dos Reis et al. (2018a) is set to *b* = 10. Scores from the classification of *U* define a histogram *H<sup>U</sup>* , which uses the bin definition of *HL*. The overflow of *H<sup>L</sup>* in *H<sup>U</sup>* is defined as

$$\text{OF}(\alpha, H^U, H^L) = \sum\_{l=1}^b \max(0, H\_l^U - \alpha H\_l^L) \tag{5.19}$$

The value *α* scales the histogram *H<sup>L</sup>* and OF measures how much the scaled histogram still has higher valued bins than *H<sup>U</sup>* . Intuitively ODIn searches for the largest parameter *α* that better fits *H<sup>L</sup>* inside *H<sup>U</sup>* , then producing the quantification estimate by correcting it for its overflow, i.e.,

$$
\hat{p}\_U^{\text{ODIn}}(\oplus) = \text{s} - \text{OF}(\text{s}, H^U, H^L) \tag{5.20}
$$

where

$$s = \sup\_{0 \le \alpha \le 1} \{ \alpha | \text{OF}(\alpha, H^U, H^L) \le \alpha \mathcal{L} \}$$

where *L* is a parameter of the method. In Moreira dos Reis et al. (2018a) the authors set *L* = ˆ*μ*OF + *dσ*ˆOF, where the values *μ*ˆ *OF* and *σ*ˆ*OF* are the mean and standard deviation of the OF function estimated on pairs of samples from *L*, and *d* is a new parameter that replaces *L*. The authors claim that the parameter *d* has a clearer semantic than *L*, i.e., *d* is the number of standard deviations of the expected average overflow, and arbitrarily set to *d* = 3 for all of their experiments.

The problem of class prior estimation in the one-class case is faced in du Plessis and Sugiyama (2014). This work has the main goal of learning a classifier from positive examples and unlabelled data, and quantification is not the subject of its proposal. Yet, the proposed method, which they call PE, performs the estimation of class priors, considering it a necessary step to learn a good classifier. Given that the correct estimation of class priors is indeed quantification, we consider this work relevant to our goals. They start from the input density formula

$$q(\mathbf{x}; \Theta) = \Theta p(\mathbf{x}|\Phi(\mathbf{x}) = \ominus) + (1 - \Theta)p(\mathbf{x}|\Phi(\mathbf{x}) = \ominus) \tag{5.21}$$

observing that *q(***x**; *)* = *p(***x***)* when = *p(*⊕*)*, thus defining a full-matching method for prior estimation. However, in the one-class case *p(***x**|*-(***x***)* = *)* is unknown. To overcome this issue the authors make the assumption that the class-conditional densities *p(***x**|*-(***x***)* = ⊕*)* and *p(***x**|*-(***x***)* = *)* are not strongly overlapping and propose a partial-matching estimation method. Such method matches only *p(***x**|*-(***x***)* = ⊕*)* to *p(***x***)* using the Pearson Divergence (PD), i.e.,

$$
\hat{p}\_U^{\text{PE}}(\oplus) = \arg\min\_{\Theta} \text{PD}(\Theta) \tag{5.22}
$$

where PD is defined as

$$\text{PD}(\Theta) = \frac{1}{2} \int \left( \frac{\Theta p(\mathbf{x} | \Phi(\mathbf{x}) = \oplus)}{p(\mathbf{x})} - 1 \right)^2 p(\mathbf{x}) d\mathbf{x} \tag{5.23}$$

The authors experimentally proved that the partial-matching method based on PD has a lower error than the method based on Equation 5.21 for the one-class case. In a subsequent work (du Plessis et al., 2017) the approach is further extended to other divergence functions.

Zeiberg et al. (2020) proposed the DistCurve algorithm that estimates the prevalence of a sample *σ* by leveraging of the concept of distance curve. A distance curve is computed starting from a sample *σ* and a labelled set *L* that contains only positive elements. Points of the curve are determined by sampling, with replacement, a random element from *L*, and measuring its distance from the closest element in *σ*, that element is removed from *σ*. The procedure continues until *σ* is empty. The idea is that the distance curve should show a steep increase in distance at the step *pσ (*⊕*)*|*σ*|, as all the positive elements have been removed from the set. A neural network is trained on distance curves generated on samples with known priors, so as to be able to predict the *p*ˆ*<sup>σ</sup>* value from the distance curve for *σ*. In order to be robust to statistical variation caused by the sampling mechanism, the distance curve for *σ* that is given as input to the neural network is determined as the average of multiple runs of the method that computes the distance curve.

## **5.8 Confidence Intervals for Class Prevalence Estimates**

A *confidence interval* (CI), in the context of quantification, is a range of values *(l, h)* which should contain the true prior probability *pU (y)* for class *y* with a desired level of confidence, such as 95%. In mathematical terms, *l* and *h* should be such that the probability of event *(pU (y)* ∈ *(l, h))* is equal to 0.95. This information is often more useful than a point estimate of class prevalence *p*ˆ*<sup>U</sup> (y)*.

Hopkins and King (2010) first mentioned computing bootstrapped CIs for their estimates, without providing much detail. CIs for quantification have received more attention in recent years. Keith and O'Connor (2018) propose a generative model, whose characteristics naturally allows for the computation of CIs for class prevalence values (Section 4.2.8). Let *pU (*⊕*)* denote the true proportion of positives in *U*. Algorithms which support *Maximum a posteriori* estimation are typically used to compute the single most plausible value for *pU (*⊕*)*, i.e. the one that is most compatible with the covariates observed in *U*, but also support the computation of likelihood values for any possible *pU (*⊕*)* ∈ [0*,* 1]. The authors exploit this idea, training different versions of the generative models. At inference time, they employ grid search over all possible (quantised) values of *pU (*⊕*)*, in conjunction with a uniform prior, constructing a posterior density from which confidence intervals are derived.

Daughton and Paul (2019) propose a technique called *error-adjusted bootstrap* to compute CIs for quantification based on the outputs of a classifier, with a correction procedure accounting for its (im)precision. In the construction of a bootstrap sample, they draw an instance with covariates **x** from *U*, and feed it to a classifier *h(***x***)*, to obtain a predicted class *c* ∈ {⊕*,* }. The bootstrap sample is expanded by using the classifier output as a parameter to sample from a Bernoulli distribution with success probability *pU (*⊕|*h(***x***)* = *c)*; (un)successful draws result in attaching class ⊕ () to the sample. Prevalence estimates for a single bootstrap sample are subsequently obtained by computing the frequency of ⊕ within it. Confidence intervals at a desired level are then constructed customarily, based on the estimates from all bootstrap samples. Crucially, the precision-related parameter *pU (*⊕|*h(***x***)* = *c)*, shaping the Bernoulli distribution, is estimated on the training sample *L*. As duly noted by Tasche (2019), this approach does not generally work under dataset shift. This is due to the fact that *pU (*⊕|*h(***x***)* = *c)* = *pL(*⊕|*h(***x***)* = *c)* is not guaranteed to hold. Hence, the approach of Daughton and Paul (2019) seems suited to handle covariate shift, a setting where the previous equation holds true.

Fernandes Vaz et al. (2019), whose work is discussed in Section 4.2.7, provide a central limit theorem for the ratio estimator, from which confidence intervals can be computed without any numerical simulation.

Tasche (2019) deploys a simulation study to shed some light on the topic of CIs in quantification tasks, under prior probability shift. Despite lacking the complexity of real-world datasets, the study provides some illustrative and interesting results in a controlled setting described very clearly. Several quantification methods are selected based on Fisher-consistency (Tasche, 2017) and popularity in the literature, including ACC (Section 4.2.3), PACC (Section 4.2.4), MS (Section 4.2.6), HDy (Section 4.2.8). Each of these methods is tested in a variety of settings, with probability shift ranging from strong to mild, exploiting underlying classifiers of variable discriminatory power, and testing on unlabelled samples of size |*U*| ∈ {50*,* 500}. For each combination of the above parameters, CIs at 90% are constructed via regular bootstrapping. One key finding is that, if a quantification method is based on an underlying classifier with high power, then the CIs will be shorter and more informative while retaining desired coverage levels.

The study also points out that, for quantification problems, prediction intervals are, in principle, more useful than confidence intervals. Indeed, a practitioner is not exactly interested in having a range for the true prior probability from which the unlabelled sample *U* originated, i.e., the target of confidence intervals. Rather, they plausibly care about having a range of plausible values for the *realised prevalence*, i.e. the percentage of points from *U* that belong to the positive class, a quantity that should be targeted by (more conservative) prediction intervals. However, the results of simulations carried out by Tasche (2019) in a variety of settings suggest that, for |*U*| *>* 50, as reasonable in most practical applications, the construction of confidence intervals is sufficient (adequate coverage) and there seems to be no need for the construction of more conservative prediction intervals.

Thanks to central limit theorems (see e.g., Section 4.2.7), confidence intervals for some approaches can be constructed without bootstrapping. Tasche (2019) also tests the effectiveness of this approach, concluding that it results in suboptimal results (e.g. low coverage) in the presence of certain conditions. As an example, if the true positive rate and false positive rate of an underlying classifier have to be estimated, a limited sample size for *L* may be a source of imprecision in said estimate, corrupting prevalence estimates and bringing about confidence intervals of insufficient size.

More recently, Denham et al. (2021) note that PCC can natively provide confidence intervals, since PCC may be thought of as computing the mean of a Poisson binomial distribution of the posterior probabilities (scaled by a constant factor), and since we know how to derive reliable confidence intervals under this assumption. The authors exploit this idea, along with other assumptions on the underlying distributions of a mixture model, to derive confidence intervals for their method GSLS (explained in Section 4.2.8).

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 6 The Quantification Landscape**

# **6.1 Historical Development**

## *6.1.1 The Trajectory of Quantification*

The "prehistory" of quantification research may be traced to the interest in the estimation of class prevalence from screening tests, as carried out in epidemiology. Accordingly, the first recorded "quantification" technique is probably the one of Gart and Buck (1966) (see Section 4.2.3). This literature is different from that discussed in the rest of this book (and this is the reason why the term "quantification" above is in quotation marks) since no training data (and no supervised learning) is involved here: the role of the classifier is here played by a clinical test that has imperfect (but known) sensitivity and specificity (see Section 6.4 for details). The estimation of class prevalence has remained an important concern of epidemiological research to this day, and several papers on this topic (e.g., Levy and Kass, 1970; Lew and Levy, 1989; Morvan et al., 2008; Rahme and Joseph, 1998; Viana et al., 1993; Zhou et al., 2002) have continued to appear in epidemiology-related journals to this day.

The first stage of such history in which supervised learning is involved coincides with interest in the estimation of class prevalence from the machine learning community, where the goal is (as already discussed in Section 2.1) that of building classifiers that are robust to the presence of distribution shift, and that are better attuned to the characteristics of the data to which they need to be applied. Here, the precursors seem to have been Vucetic and Obradovic (2001), but the most influential paper to date in this field is certainly that of Saerens et al. (2002); later works are, e.g., Alaíz-Rodríguez et al. (2011), Chan and Ng (2005), Chan and Ng (2006), Xue and Weiss (2009), and Zhang and Zhou (2010). As mentioned in Section 2.1, in this stream of research the estimated class prevalence values are not interesting *per se*, but only serve the purpose of allowing a better estimation of the posterior probabilities *p(y*|**x***)* (and hence a more accurate classification) for unlabelled data in contexts characterised by significant distribution shift.

The second and last stage of such history coincides with interest from data mining, text mining, and content analysis; it is mainly the applications from these fields that have provided the impetus behind the most recent wave of research in quantification.

Some 10 years before the very first developments in this line of research, Lewis (1995, §7) had already evoked a task (that he called *counting*) that was to consist of simply counting the unlabelled items that belonged to a given class (which, once the counts are normalised by the total number of unlabelled items, coincides with quantification). In that paper Lewis observed that "if our goal is to count class members, and if we have estimates of the probability of class membership, we should use the estimates directly to estimate the number of class members, rather than use them to classify documents"; this is exactly the principle that Bella et al. (2010), unaware of Lewis' observation 15 years earlier, based their PCC method upon (see Section 4.2.2). In this work, Lewis briefly discussed a potential evaluation measure for "counting" (which consisted of the square of the differences between FP and FN), but did not discuss the task in any further detail. His remarks about "counting" went essentially unnoticed, and quantification had to wait another 10 years in order for someone to call attention to the need to study it as a task separate from classification.

This finally happened with (Forman, 2005) and the papers by the same author that soon followed (Forman, 2006, 2008; Forman et al., 2006); it is in these papers that the term "quantification" was coined, a term that has since stuck and become standard terminology. Contrary to the works mentioned above (re: "first stage of such history"), in these works the estimated class prevalence values are the true objects of interest. These works eventually became well-known among, and inspired, researchers in machine learning, data mining, and text mining to develop the new methods and algorithms that we have discussed in Sections 4 and 5.

There is one chapter in the history of quantification research that has yet to be written, though, i.e., the one on a widespread uptake of quantification technology by users, that unfortunately has yet to happen. One only needs to look at the proceedings of, say, recent computational social science conferences, to realise how many works are carried out where classification is used despite the fact that the investigators are only interested in results at the aggregate level. Undoubtedly, this has to do with a scarce awareness, on the part of data scientists, that prevalence estimation is not just a by-product of classification. It is a goal of this book to improve this awareness.

## *6.1.2 Shared Tasks*

To the best of our knowledge, the only shared task that has gathered researchers on a challenge that explicitly addressed quantification is the "Sentiment Analysis in Twitter" task of the SemEval-2016 (Nakov et al., 2016) and SemEval-2017 (Nakov et al., 2017) evaluation campaigns. The general goal of this task was to evaluate algorithms that classify tweets by sentiment. In both 2016 and 2017, this task included a binary quantification subtask (where positive vs. negative attitudes towards the designated object had to be identified) and an ordinal quantification subtask (where these attitudes had to be graded on an ordered scale of five values). That a shared task devoted to sentiment classification in Twitter should include subtasks devoted to quantification is just natural, given the fact that (as already mentioned in Section 2.3) most researchers and practitioners who apply sentiment classification technology to Twitter datasets are essentially interested in aggregate results.

One fairly disappointing result of those subtasks was that most participants used Classify and Count solutions, albeit often based on some sophisticated sentiment classification technology using deep learning. This testifies to the fact that, despite its many potential applications, quantification is still a fairly unknown task, and that there is very little awareness that Classify and Count delivers suboptimal quantification accuracy.

An ongoing challenge at the time of writing this book is the LeQua 2022 lab on Learning to Quantify (Esuli et al., 2022). The challenge brings*textual* quantification into focus, and comes with 2 separate tasks: "T1" for binary quantification, and "T2" for single-label quantification. Each of the tasks admits two variants, one in which documents come in the form of dense vectors, and another where documents come in raw form. The datasets consists of product reviews from the Amazon website. Task "T1" consists of predicting the binary class prevalence of the sentiment polarity of the reviews, while task "T2" consists instead of predicting the class prevalence of the merchandise categories ("Automotive", "Baby", "Beauty", . . . ) of the products, for a total of 28 categories. While the training samples reflect the natural prevalence as from the Amazon website, the validation and test samples are generated following artificial prevalence values, according to the Kraemer sampling algorithm discussed in Section 3.4. The results of the challenge will be presented at the CLEF 2022 conference.

## **6.2 Software**

## *6.2.1 Publicly Available Implementations*

Throughout the second phase of the history of quantification (Section 6.1), especially in recent years, several works have been published that make software implementations public, thus favouring the reproducibility and, more broadly, the adoption of quantification techniques. Indeed, publishing a software implementation of a method proposed in a paper produces many benefits to research, e.g., it provides a reference implementation, it allows peers to replicate the experiments, and it facilitates the comparison of the method with others in lab experiments. Some of the authors who have published papers on quantification methods have published software implementing their methods, and, sometimes, also of the methods they used as baselines. Table 6.1 reports on the available implementations of quantification methods, the papers where the link to the implementation is to be found, and the sections of this book in which the method is discussed.

## *6.2.2 QuaPy: A Comprehensive Framework for Quantification*

The last of the packages in Table 6.1 is our own. It is called QuaPy (Moreo et al., 2021a), and was originally conceived as supplementary material accompanying this book. As such, it provides implementations of the main concepts discussed here, and using the same "jargon". Differently from other existing packages, QuaPy is not only a suite of methods, but an ecosystem for quantification, catering for model evaluation (including implementations for the most important evaluation measures), model selection (targeting quantification-oriented loss functions), and visualisation tools for analysing the experimental results (some examples are shown in Section 6.3.2). QuaPy also provides access to commonly used datasets, and implements a common interface to allow using other datasets. It is a Python-based open-source package with BDS-3-Clause licence that can be directly installed via pip. <sup>1</sup> It is extensible and in constant evolution, so that anyone can contribute new material via GitHub.<sup>2</sup>

Figure 6.1 shows a complete example of QuaPy's usage. In this example, the IMDb dataset of movie reviews is fetched (it is downloaded the first time) and vectorised using TFIDF weights. The example goes on by training a PACC quantifier that uses Logistic Regression as the probabilistic classifier. The quantifier hyperparameters (*C* and *class\_weight* in this case, all coming from the classifier) are optimised via grid search using the artificial prevalence protocol for generating a maximum of 100 validation samples of 500 data items each (as indicated by *eval\_budget* and by the environment variable *SAMPLE\_SIZE*, respectively) out of a 25% held-out validation set and in terms of mean absolute error. The model is refitted on the entire training set once the hyperparameters have been optimised. Model training is then followed by model evaluation, by applying the artificial prevalence protocol anew, this time on the test set. The evaluation routine used in this example is one that generates a Pandas dataframe containing the error figures for absolute error, relative absolute error, and Kullback-Leibler divergence (see Figure 6.2).

<sup>1</sup> https://pypi.org/project/QuaPy/

<sup>2</sup> https://github.com/HLT-ISTI/QuaPy


**Table 6.1** Software packages implementing quantification methods. **Boldface** indicates the main method proposed by the paper where the link to the softwareto be found. The "Section" column indicates where the main method is discussed in this book. The lower block of the table lists software packages that directlylinkedspecificmethod.

is

are

(continued)



```
1 import quapy as qp
2 from quapy.method.aggregative import PACC
3 from sklearn.linear_model import LogisticRegression
4 import numpy as np
5 import pandas as pd
6
7 # setting this environment variable allows some
8 # error metrics (e.g., mrae) to be smoothed
9 qp.environ["SAMPLE_SIZE"] = 500
10
11 dataset = qp.datasets.fetch_reviews('imdb', tfidf=True, min_df=5)
12
13 # model selection with the APP
14 model = qp.model_selection.GridSearchQ(
15 model=PACC(LogisticRegression()),
16 param_grid={
17 'C': np.logspace(-4, 5, 10),
18 'class_weight': ['balanced', None]
19 },
20 protocol='app',
21 eval_budget=100,
22 error='mae',
23 refit=True, # retrain on the whole labelled set once done
24 val_split=0.25,
25 ).fit(dataset.training)
26
27 df = qp.evaluation.artificial_prevalence_report(
28 model, # the quantification method
29 dataset.test, # the set on which the method will be evaluated
30 n_prevpoints=101, # i.e., using the grid [0.,.01,.02,...,.99,1.]
31 n_jobs=-1, # the number of parallel workers (-1 for all CPUs)
32 random_seed=42, # allows replicating test samples across runs
33 error_metrics=['ae', 'rae', 'kld']) # evaluation metrics
34
35 print(f'best hyper-params={model.best_params_}')
36
37 pd.set_option('display.max_columns', None)
38 pd.set_option('display.width', 100)
39 print(df)
40
```

```
Fig. 6.1 Code example using QuaPy (version 0.1.6).
```
## **6.3 How Do Different Quantification Methods Fare?**

## *6.3.1 A Tour of Experimental Results*

In this section we show some of the most important quantification systems in action. This set of experiments is not intended to be exhaustive, nor is it intended to make conclusive statements about the relative merits of the different quantification systems being tested. The aim of this experimentation is rather that of demonstrating some of the major performance trends that typically arise naturally in different

```
best hyper-params={'C': 100.0, 'class_weight': 'balanced'}
 true-prev estim-prev ae rae kld
0 [0.0, 1.0] [0.057592, 0.942407] 0.057592 28.824875 0.055245
1 [0.01, 0.99] [0.034542, 0.965457] 0.024542 1.127931 0.011950
2 [0.02, 0.98] [0.039174, 0.960825] 0.019175 0.466312 0.005742
3 [0.03, 0.97] [0.035338, 0.964661] 0.005339 0.088854 0.000428
4 [0.04, 0.96] [0.081784, 0.918215] 0.041784 0.531303 0.013911
.. ... ... ... ... ...
96 [0.96, 0.04] [0.948444, 0.051555] 0.011556 0.146937 0.001445
97 [0.97, 0.03] [0.972371, 0.027628] 0.002372 0.039477 0.000099
98 [0.98, 0.02] [0.967576, 0.032423] 0.012423 0.302125 0.002743
99 [0.99, 0.01] [0.967542, 0.032457] 0.022458 1.032138 0.010480
100 [1.0, 0.0] [0.996870, 0.003129] 0.003129 1.566181 0.001716
```
**Fig. 6.2** QuaPy's output example (version 0.1.6).

experimental settings. A more comprehensive overview and understanding of the relative merits of the different quantification systems might only be obtained by analysing the experimental evaluation carried out by multiple teams; see, e.g., Moreo and Sebastiani (2022), Pérez-Gállego et al. (2019), and Schumacher et al. (2021). The experiments we report here are extracted from Moreo et al. (2021a) and are obtained using the QuaPy framework.<sup>3</sup>

As the learning methods we chose CC (Section 4.2.1), PCC (Section 4.2.2), ACC (Section 4.2.3), PACC (Section 4.2.4), Forman's variants MAX (Section 4.2.5), MS and MS2 (Section 4.2.6), the mixture model HDy (Section 4.2.8), the expectationmaximisation-based SLD method (Section 4.2.9), SVM(AE) (Section 4.3.1) as the representative of the "explicit loss minimisation" family (minimising the same evaluation metric we use here), and E*(*HDy*)*DS as the representative of ensemble methods (Section 4.2.11); we set the number of base quantifiers to 30 and the number of members to be selected dynamically to 15 (we perform model selection independently for each base member).

The evaluation benchmark consists of 30 binary datasets coming from the UCI Machine Learning datasets, as were previously used by Pérez-Gállego et al. (2017). Results are mean AE scores (Section 3.1.3) obtained via 5-fold cross-validation. For each test fold, we follow an APP protocol (Section 3.4.2) and generate 100 different random samples of 100 instances each, using a grid of prevalence values {0*.*00*,* 0*.*05*,...,* 0*.*95*,* 1*.*00}. The hyperparameters of the quantifiers are optimised via model selection for quantification (Section 3.5); in this case, minimising the

<sup>3</sup> The code to replicate all these experiments, and to generate the relative tables and plots, can be accessed via GitHub. See the files uci\_experiments.py (runs all experiments), uci\_tables.py (generates Table 6.2 directly in LATEX), and uci\_plots.py (generates plots from Figures 6.3, 6.4, 6.5, 6.6) included in the folder wiki\_examples/ of the repository https:// github.com/HLT-ISTI/QuaPy.wiki.git




**Table 6.2** (continued)

**Fig. 6.3** Diagonal plot.

**Fig. 6.4** Error-by-Shift plot.

mean AE score of APP in a stratified validation split consisting of 40% of the training set. The model, with optimised hyperparameters, is re-fit on the whole training set before estimating the test prevalence values. Except for SVM(AE), that natively uses SVMperf (Joachims, 2005), all other quantifiers rely on a Logistic

**Fig. 6.5** Global Bias-Box plot.

**Fig. 6.6** Local-Bias-Box plot with 5 bins.

Regressor as the underlying classifier. We explore the regularisation parameter *<sup>C</sup>* (common to LR and SVM) in {10−3*,* <sup>10</sup>−2*,...,* <sup>10</sup>2*,* <sup>10</sup>3}, and the parameter class\_weight (only for LR) in {"balanced" , "not balanced"}.

These results are fairly consistent with other results previously reported in the literature (Moreo and Sebastiani, 2021, 2022; Pérez-Gállego et al., 2019; Schumacher et al., 2021). They clearly indicate the quantifier SLD behaves very well overall (in this case beating all other methods in 13 datasets out of 30). Methods E*(*HDy*)*DS (8 times best method), PACC (4 times best method), and (to a lesser extent) ACC (2 times best method), also fare very well, obtaining average ranks not statistically significantly different from the best average rank obtained by SLD. The method SVM(AE) tends to produce results that are markedly worse than the rest of competitors. In line with the observations of Schumacher et al. (2021), none of the variants MAX, MS, MS2 improve over ACC. Also in line with the findings of Pérez-Gállego et al. (2019), the ensemble E*(*HDy*)*DS clearly outperforms the base quantifier HDy it is built upon. A general trend that emerges in this experimentation, and that is consistent with almost any other (not to say all) reported experiments, concerns the fact that performing classification alone (as, e.g., CC, PCC, SVM(AE)) does not suffice for providing accurate estimations of class prevalence values in situations of distribution shift; in such situations one typically needs to perform some sort of adjustment to the prevalence estimation derived from the use of a (biased) classifier.

## *6.3.2 Visualisation Tools for the Analysis of Results*

While averaged error scores do certainly speak clearly about the macro behaviour of quantification systems, they do not tell the entire story. The analysis of results can sometimes be complemented with the aid of visualisation tools that can help to unravel how a system performs in specific experimental conditions. This is specially useful in scenarios in which the practitioner wants to better understand how the system fares, say, in presence of high/low shift, or in regions of high/low prevalence. Complementing the analysis with such additional viewpoints is interesting since, for reasons discussed in Section 3.4.4, some protocols are sometimes criticised for involving testing conditions that some practitioners might deem unlikely to occur in real cases. In what follows, we discuss some useful types of plots that can be helpful in practical scenarios.

One plot which is of particular relevance for the analysis of binary quantifiers is the so-called "diagonal plot". This plot displays the predicted prevalence values along the y-axis against the true prevalence values in the x-axis; predicted values are sometimes binned according to the true prevalence values. The plot is called "diagonal" since the ideal quantifier is described by a diagonal line, from coordinates (0,0) to (1,1). An example of this plot, computed on the same batch of experiments reported in Table 6.2, is offered in Figure 6.3 (we also showed some examples in Section 1.6). This type of plot allows one to rapidly grasp intuitions about the tendency of a quantification method to systematically overestimate or underestimate the true class prevalence. In this example, the plot reveals that, for high prevalence values of the Positive class, SLD tends to slightly overestimate the class prevalence values, while most other methods tend instead to underestimate them. For low prevalence values of the Positive class, methods MAX, MS, MS2, PCC, and CC show a tendency to overestimate these prevalence values.

This plot is sometimes enriched by error bars, or colour bands around the averaged results, representing the deviations from the average. It is, however, sometimes cumbersome to plot all this information in a single plot, with many graphical elements ending up inevitably juxtaposed on top of each other. Not displaying them might however lead to misleading conclusions, since a method displaying high variance could anyway seem to perform very well, by looking at the averages of predictions, whenever the estimator is an unbiased one. (In this example, we have opted for omitting them for the sake of clarity, but some examples of diagonal plots including colour bands can be found in Esuli et al. (2018), or in the Figures 1.5 and 1.6 accompanying this book.) Yet another limitation of this kind of plot is that it is reserved for binary problems only. While it is true that one could display a dedicated diagonal plot for each of the classes in a SLQ problem, it is no less true that the intuitions one gains by inspecting diagonal plots get blurred as the number of classes increase.

Another type of plot that does not present this limitation is what we might call the "Error-by-Shift plot". This plot displays any target error metric (say, AE) along the y-axis as a function of the distribution shift between the training set and each of the test samples, on the x-axis. As for the diagonal plot, one typically displays averaged values across bins; here too, error bars or colour bands might help to reveal the system variance provided that the number of visual elements is moderate. Since this plot works with the concept of "shift" (as implemented in terms of any error or divergence metric), it can be applied to any problem characterised by any number of classes. Figure 6.5 shows an example for the experiments of Table 6.2. Note that the errors in the left-hand size of the plot correspond to situations in which the test and the training prevalence are close to each other, while errors on the righthand size of the plot describe how the systems perform in cases of high shift. In this particular example, the plot reveals how E*(*HDy*)*DS excels at situations characterised by low distribution shift, while SLD seems the most robust in dealing with high-shift scenarios. This example consists of averages across 30 datasets, and so for many of them there are few, or none at all, cases of very high shift; this explains why the curves look less stable in the right-most part of the plot.

As mentioned before, both the Diagonal plot and the Error-by-Shift plot struggle to display error variances when the number of methods to compare becomes relatively high. The "Bias-Box plot" is specifically devised for studying the distribution of the error predictions in such cases. This kind of plot resorts to the well-known box plots to display the bias of the system, i.e., the signed error difference between the true class prevalence and the predicted class prevalence (see Equation 3.1 in Section 3.1.2). A box plot summarises a distribution by means of different graphical elements: the extremes of the box delimit the first and third quartiles of a distribution, a central line represents the median of the distribution while the position of a small triangle represents the average of the distribution, the maximum and the minimum are represented by the whiskers on the top and on the bottom the box, and finally the outliers appear above or below the corresponding whiskers. Figure 6.5 shows the Bias-Box plot of our experiments. This diagram reveals that PACC, SLD, and E*(*HDy*)*DS are the methods displaying the lowest bias overall, given that their boxes are the most squashed, and given that their whiskers are the shortest. Note how the reduction of variance with respect to the base members (HDy) that characterise the ensemble methods (E*(*HDy*)*DS) is clearly perceivable in the last two boxes; this is in line with the observations reported by the inventors of this method (Pérez-Gállego et al., 2019). It is also interesting to note how the heuristic implemented in MS2 drastically reduces the variance displayed by MS.

As in the other cases, this plot is not exempt from limitations, though. Given that this plot uses distributions based on the bias (signed error difference), this plot gets unavoidably tied to one class (acting as the positive class), and is thus more appealing for binary problems. Yet another limitation of the Bias-Box plot has to do with the fact that the distribution of the bias is computed on the whole experiment, which might involve (as it does indeed involve when APP is adopted) cases of severe distribution shift mixed up with cases of very low shift. The "Local-Bias-Box plot" can be of help in situations in which one prefers to crumble up the distribution in different pieces each characterised by a different prevalence range, or by a different range of shift. In Figure 6.6 we show the Local-Bias-Box plot for our experiments, in which we bin the error bias in five ranges of true prevalence. This plot reveals how the "unadjusted" methods (e.g., CC, PCC) display positive bias for low prevalence values (thus showing a tendency to overestimate the true prevalence) and negative bias for high prevalence values (thus showing a tendency to underestimate the true prevalence). The "adjusted" versions (ACC and PACC), on the contrary, manage to reduce this effect, as witnessed by the fact that their box plots are centred at zero bias in those cases. This plot also reveals that MS tends to display a huge positive bias in the low-prevalence regime, while SVM(AE) displays a huge negative bias in the high-prevalence regime.

## **6.4 Related Tasks**

## *6.4.1 Links to Existing Tasks*

Quantification bears strong relations with *prevalence estimation from screening tests*, an important task in epidemiology (see Levy and Kass, 1970; Lew and Levy, 1989; Rahme and Joseph, 1998; Zhou et al., 2002); indeed, as already hinted in Section 6.1, the ACC quantification method discussed in Section 4.2.3 was used (in its binary form) for this task well before research in quantification was born. A screening test is a test that a patient undergoes in order to check if s/he has a given pathology. Tests are often imperfect, i.e., they may give rise to false positives (the patient is incorrectly diagnosed with the pathology) and false negatives (the test wrongly diagnoses the patient to be free from the pathology). Therefore, testing a patient is akin to classifying a data item, and using these tests for estimating the prevalence of the pathology in a given population is akin to performing quantification via classification. The main difference between this task and quantification is that a screening test typically has known and fairly constant recall (that epidemiologists call "sensitivity") and fallout (whose complement epidemiologists call "specificity"), while the same usually does not happen for a classifier.

Quantification is also closely related to the problem of *density estimation* (Silverman, 1986), defined as the estimation, based on observed data, of the unknown probability density function of a given random variable; if the random variable is discrete, this means estimating, using observed data, the unknown distribution across the discrete set of events, i.e., across the classes. A classic, textbook example of density estimation is estimating the prevalence of white balls in a large urn containing white balls and black balls. However, quantification and density estimation are different in at least two respects. First, the above "urn" example assumes that, when we pick a ball from the urn, we can deterministically assess whether the ball is black or white, by simple visual inspection; in quantification we instead assume that assessing whether a given item belongs to the class is not a deterministic operation, and depends on subjective judgment. A second key difference is that the density estimation problem arises from the fact that in many applications it is practically impossible to assess class membership for each single individual (e.g., we do *not* want to inspect every single ball in the urn); however, in the case of quantification it is feasible to analyse every single item, since this is done automatically. (This is due to the fact that the items that are the object of quantification are digital objects, and any number of them can be processed given enough computational resources.) These differences clearly indicate the existence of a task different from density estimation, and characterised (a) by the need to assess class prevalence when class membership cannot be established deterministically, and (b) by the fact that *all* individuals contained in the sample can be analysed. These facts indicate altogether that our task is closely related to *classification*, a task in which facts (a) and (b) both hold. However, the goal of classification is different from the one we have set ourselves, since in classification we are interested in correctly estimating the true class of each single item.

A research area that might seem related to quantification is *collective classification* (CoC) (Sen et al., 2008), as in statistical relational learning. Similarly to quantification, in CoC the classification of instances is not viewed in isolation. However, CoC is radically different from quantification in that its focus is on improving the accuracy of classification by exploiting relationships between the items to classify (e.g., hypertextual documents that link to each other). For instance, in certain applications characterised by "homophily" (i.e., the tendency of individuals to associate with their similar) the fact that a data item has a certain label may provide additional evidence towards the fact that a related data item (say, one that is hyperlinked to the previous one) may have that label too. Differently from quantification, CoC assumes the existence of explicit relationships between the items to classify (which quantification does not), and is evaluated at the individual level, rather than at the aggregate level as quantification is.

Another related research task is *divergence approximation* (Sugiyama et al., 2013), which consists of estimating the divergence between two distributions. This seems, on the surface, akin to evaluating the accuracy of quantification. However, the main difference is that divergence approximation is performed when one does not have access to the two distributions, but only to finite samples from them. In other words, divergence approximation is useful when one is interested in the divergence of two distributions that should be estimated via the density estimation techniques previously discussed in this section: in this case, as Sugiyama et al. (2013) put it, "directly approximating the divergence without estimating probability distributions is more sensible than a naive two-step approach of first estimating probability distributions and then approximating the divergence." Evaluating the accuracy of quantification is thus different from divergence approximation because of the very same factors that make quantification and density estimation different.

Yet another related task is *learning with label proportions* (de Freitas and Kück, 2005; Quadrianto et al., 2009), which consists of learning to estimate the class labels of individual items when training data comes in the form of samples of such items with labels at the aggregate level. In other words, we do not know the class labels of individual training items, but we only know the class prevalence of samples of such items. This is the other way around with respect to quantification, where we need to predict labels at the aggregate level by learning from training data which are labelled at the individual level.

## *6.4.2 A Possible Variant of the Quantification Task*

Quantification, as defined in this book and in the literature that this book looks at, is (somehow similarly to learning with label proportions) an unusual supervised learning task, in that the labels that we need to predict and the labels we use in order to train our predictors are not homologous, i.e., are not of the same type. In fact, in quantification we start from a training set of labelled items, and we need to predict the prevalence of the classes in a sample of unlabelled items. In other words, in the training data the labels (i) are attached to each individual item, and (ii) are drawn from the set *Y* of classes, while in the unlabelled data for which we need to issue predictions the labels (iii) must be attached to each pair consisting of a sample (i.e., a *set* of individual items) and a class, and (iv) are drawn from the [0,1] interval. This is unlike most other tasks in supervised learning (e.g., classification, regression), where the training items and the unlabelled items that need to be labelled are homologous, and where the labels of the training items and the labels to be attached to the unlabelled items are drawn from the same set.

In the future, one might want to investigate a variant of the quantification task in which the training data and the unlabelled data are homologous, and where the training labels and the labels to be predicted are homologous too. In this variant, the training data would thus consist not of a set of items labelled at the individual level, but of a set of labelled *samples*, where the labels are from the [0,1] interval and where no labels are attached to the individual items. The advantage of this formulation would be the possibility to use more standard tools from the arsenal of supervised learning machinery, since this would squarely be a standard regression task (albeit one in which the label *pσ (yi)* for a pair *(σ, yi)* must be in [0,1] and the sum *yi*∈*<sup>Y</sup> pσ (yi)* must be equal to 1).<sup>4</sup>

The disadvantage of this formulation is that it may appear unnatural, since in many applications labelled data tend to come in the form of labelled individual items, rather than labelled samples. Still, applications in which there is no access to the individual labels but a label at the collective level is available, indeed exist (as in the "learning with label proportions" task); for instance, in datasets of a medical nature the individual labels of training data might be masked off due to privacy considerations, but a label for the entire set might be available. In the future it might be interesting to investigate whether the advantages brought about by this formulation offset its disadvantages or not.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

<sup>4</sup> Indeed, this formulation is clearly reminiscent of *multiple-instance regression* (Dooly et al., 2002; Ray and Page, 2001), a class of supervised learning techniques in which individual items (called *instances*) each have a vectorial representation and are grouped into sets (called *bags*). In multipleinstance regression, only the bags, and not the individual instances, have (real-valued) labels.

# **Chapter 7 The Road Ahead**

Quantification has seen a growing amount of work in the last 15 years, spawned by the realisation that there are a lot of application settings in which the class labels to be attributed to individual items are not interesting *per se*, but are only the stepping stones towards estimating prevalence values for the classes of interest. While research on learning to quantify has grown steadily since 2005 onwards, much more is still needed in order to stably deliver accurate results across the entire range of applicative settings on which quantification can be employed.

What is the road ahead, then, for learning to quantify? While there are margins of improvements on all the areas that this book has touched upon, from plain single-label quantification to the more complex ordinal quantification, from standard application contexts to more peculiar ones involving, say, streaming data or multilingual text, we think there are a few "burning topics" which are sorely in need of (and that are likely to see) further work:

• *Quantification and deep learning*. While deep learning has had an enormous impact on AI and machine learning in general, and on classification in particular, there has not been much work on applying deep learning to quantification; so far, the only works in this department are Esuli et al. (2018), Sanya et al. (2018), and Qi et al. (2020), discussed in Sections 4.2.12 and 4.3.1, respectively. While nowadays neural architectures naturally cater for variable-length sequential data, how to properly represent (or embed) unordered sets of elements is less clear, and it has been shown that simply arranging the elements in the set in an arbitrary order is problematic (Vinyals et al., 2016). Since unordered sets represent the primary form of interest in learning to quantify, it is likely that the study of permutation-invariant functions will become a central subject in future research on deep learning and quantification. Although some attempts have been made in trying to represent unordered sets of inputs with deep learning architectures (Vinyals et al., 2016; Zaheer et al., 2017), more recent work suggests that this field is yet to be well understood (Wagstaff et al., 2019).


One possible solution might consist of ranking the unlabelled items in decreasing order of the posterior probabilities generated by our classifier, and setting the classification threshold exactly at the value that justifies the class prevalence estimated by our quantifier; the items ranked above the threshold would thus constitute the "explanation" of the class prevalence returned by our quantifier. (The classifier, if generated with "explainable machine learning" technology, would in turn provide explanations for its individual classification decisions.) Still, this threshold would be different from the one that "our best possible classifier" would use, which makes this solution suboptimal. Research on quantification and explainability is thus sorely needed.

• *Transductive quantification.* A number of applications of quantification are transductive in nature, i.e., there is a single, finite set of unlabelled items for which we are interested in estimating class prevalence values, and this set is available at training time. For instance, in the "What do you think of onions in cheeseburgers?" scenario mentioned at the very beginning of Chapter 1, the market research expert may be interested in running this survey monthly, in order to track the evolution of customers' preferences (such a survey would be called a "tracker", in market research jargon). Alternatively, she might be interested in running the survey only once, in a one-off manner; in this case, the quantifier can be trained "on purpose" once the survey data are in, and the training process can take advantage from the fact that the data to quantify on are already available.

Transductive quantification is yet another context in which Vapnik's principle applies: estimating class prevalence values for a finite set of data is a less general (hence simpler) problem than generating a quantifier that generalises to the entire domain. So far, this aspect has been exploited by a few methods, e.g., in Saerens et al. (2002)'s SLD method (Section 4.2.9) and Xue and Weiss (2009)'s CDE-Iterate method (Section 4.2.10); the fact that, for tasks other than quantification, transductive inference has been investigated quite frequently in recent years, and the abundance of contexts to which it can be applied, should incentivise researchers in devoting more effort to this area.

However, if there is one aspect of the quantification task that is even more sorely in need of advancement than the ones mentioned above, this is the awareness of its very existence on the part of its potential users. The large majority of application papers in which class prevalence values need to be estimated on sets of unlabelled data, still use Classify and Count, essentially because the authors ignore that there is a better alternative out there. Raising the awareness that class prevalence estimation is a problem that should be solved by its own specific techniques is a necessary step. This awareness is important especially since, with the advent of big data, more and more application contexts spring up in which we cannot afford analysing the data at the individual level, and the aggregate level is what we have to be happy with.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Bibliography**


© The Author(s) 2023

A. Esuli et al., *Learning to Quantify*, The Information Retrieval Series 47, https://doi.org/10.1007/978-3-031-20467-8


# **Index**

#### **Symbols**

*F*1, 3, 34 *Fβ* , 35 *X* → *Y* problem, 12 *Y* → *X* problem, 13

#### **A**

Absolute Error (AE), 37, 38, 46, 77 Mean, 38 Normalised (NAE), 38 Normalised Relative (NRAE), 40 Relative (RAE), 38 Adjusted Classify and Count (ACC), 10, 59–65, 71, 72, 77, 91, 97, 100, 117 Adjusted Regress and Sum (ARS), 89 Artificial-prevalence protocol, 49 Authorship attribution, 13 Axiomatic approach to evaluation, 35

#### **B**

Bias, 4, 14, 15 sample selection, 10, 15 Bias (evaluation measure), 37 Bray-Curtis dissimilarity (BCD), 38

#### **C**

Calibration, 20, 57, 59 City-block distance, 38 Classification, 1, 19 collective, 118 Classification error balancing, 77 Classification-quantification balancing, 77 Classifier, 5 fairness of a, 22 hard, 5 soft, 5 Classify and Count (CC), 3, 27–29, 58, 64, 72, 77–79, 92, 105 Classify and Total (CT), 92 Class distribution estimation, 2 Class prevalence, 1 Class prior estimation, 2, 19 Class probability re-estimation, 2 Codeframe, 5 Computational social science, 25 Concave function, 76 nested, 76 Confidence interval, 65, 82, 99, 101 Contingency table, 4 Cosine distance, 36 Cost-sensitive learning, 71 Counting, 2, 104 Cramér-von-Mises statistic, 48 Cross-lingual Structural Correspondence Learning, 90

#### **D**

Deep learning, 73, 76, 121 Density estimation, 118 Discordance ratio, 34 Distribution ordinal, 2 predicted, 2, 5, 36 probability, 2 true, 2, 5, 36 Distributional correspondence indexing, 90

© The Author(s) 2023 A. Esuli et al., *Learning to Quantify*, The Information Retrieval Series 47, https://doi.org/10.1007/978-3-031-20467-8

Distribution *y*-Similarity (DyS), 67 Divergence, 3, 34, 67 Kullback-Leibler (KLD), 41, 67, 75–77, 87 Normalised Kullback-Leibler (NKLD), 41 Pearson (PD), 34 Divergence approximation, 119 Drift, 9

#### **E**

Earth Mover's Distance (EMD), 45 Estimator, 5 perfect, 4 perverse, 35 Explicit loss minimisation, 75 Extrinsic label, 13

#### **F**

False negative, 4, 5 False positive, 4, 5 Fisher consistency, 60, 71

**G** Grossed-Up Total (GUT), 93

#### **H**

HDx, 81, 82 HDy, 66, 68, 72, 82, 100 Hellinger distance (HD), 66–68, 73, 81

#### **I**

Independently and identically distributed (IID), 8, 11, 15, 19, 59 Intrinsic label, 13

#### **K**

Kolmogorov-Smirnov Mixture Model (MM(KS)), 66

#### **L**

Learning with label proportions, 119 Link-Based Quantification, 91

#### **M**

Maximum Likelihood Prevalence Estimation (MLPE), 11, 56 Median Sweep (MS), 64, 72

Multi-objective measure, 77 Multivariate loss function, 75

#### **N**

Natural-Prevalence Protocol, 49 Nonlinear loss function, 75

#### **O**

Online stochastic optimisation, 76 Overestimation, 35

#### **P**

Political science, 25 PP-Area Mixture Model (MM(PP)), 66 Prevalence estimation from screening tests, 117 Prior, 1 Probabilistic Adjusted Classify and Count (PACC), 10, 61, 63–65, 100 Probabilistic Classify and Count (PCC), 58, 61, 63, 78, 87, 96, 104 Probability calibrated, 20 posterior, 5, 58, 88, 122 prior, 1 Proportional equality, 22

#### **Q**

Quantification, 1, 19 binary, 7, 34, 63 cost, 92 explainable, 122 multi-label, 7, 34 ordinal, 7, 45 regression, 7, 47, 88 sentiment, 24 single-label, 6–8, 34, 35, 49, 66, 87 stance, 26 text, 73, 78, 79 cross-lingual, 90 Quantification forests, 77 Quantification methods aggregative, 55, 57 non-aggregative, 55, 78, 122 Quantification trees, 76 Quantifier, 3

#### **R**

Ratio estimator (RE), 65

#### Index 137

ReadMe, 26, 29, 78, 80, 82 Regress and Splice (RSp), 89 Regress and Sum (RSu), 88 Relative frequency, 1

#### **S**

Saerens-Latinne-Decaestecker algorithm (SLD), 69–71, 123 Sample, 5, 34 SemEval, 41, 45, 87, 104 Shared tasks, 41, 45, 87, 104 Shift concept, 11–13, 52 covariate, 11–13, 52, 100 dataset, 9–11, 19, 60, 71, 88, 89 distribution, 9–13, 19, 20, 28, 29, 56, 103, 104 prior probability, 13, 52, 60, 65, 71, 80 Smoothing, 41 Social sciences, 25 Squared Error (SE), 38, 47 Structured output learning, 75

#### **T**

Threshold at 0.50 (T50), 63 Topsøe distance, 67 Transduction, 69, 122 Transfer learning, 21, 91 Trivial predictor, 56 True negative, 5 True positive, 4, 5

#### **U**

Underestimation, 35

#### **V**

Vanilla accuracy, 56 Vapnik's principle, 78, 122, 123 Vaserste ˘ ˘ın metric, 45

#### **W**

Word sense disambiguation, 21